Name: Soh Hong Yu
Admin Number: P2100775
Class: DAAA/FT/2A/01
Module Code: ST1511 AI and Machine Learning
Use a time series model to predict and forecast the amount of air pollutants in open air
The dataset seems to be derived from a Air Quality dataset found in UCI Machine Learning Repository, with some of the values in this dataset to match with the Air Quality dataset. Based on the dataset information, here we can assume that the data is curated using 4 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level, within an Italian city.
# Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import seaborn as sns
import warnings
from itertools import product
from pathlib import Path
# Metrics
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error
# Models
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.statespace.varmax import VARMAX
# EDA + Model Selection
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.stattools import adfuller
# Pickle Library = Saving Models
import pickle
df = pd.read_csv("./train.csv", sep=",")
df.head()
| Date | T | RH | Gas | Value | Unnamed: 5 | Unnamed: 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 15/3/2016 | 12.020833 | 54.883334 | CO | 1053.200000 | NaN | NaN |
| 1 | 16/3/2016 | 9.833333 | 64.069791 | CO | 995.250000 | NaN | NaN |
| 2 | 17/3/2016 | 11.292708 | 51.107292 | CO | 1025.250000 | NaN | NaN |
| 3 | 18/3/2016 | 12.866319 | 51.530903 | CO | 1064.444444 | NaN | NaN |
| 4 | 19/3/2016 | 16.016667 | 48.843750 | CO | 1088.741667 | NaN | NaN |
We will begin by conducting an exploratory data analysis of the data, to gain a better understanding of the characteristics of the dataset.
This is a dataset collected from a Air Quality dataset found in UCI Machine Learning Repository, it contains 1312 data points with 7 columns.
Date - date
T - temperature
RH - relative humidity
Gas - type of pollution gas
Value - amount of pollution
To prevent mutation of our original data, we will make a copy of our data to perform eda on it.
df_eda = df.copy()
df_eda.shape
(1312, 7)
There are 1312 rows and 7 columns in the entire data set.
df_eda.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1312 entries, 0 to 1311 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 1312 non-null object 1 T 1312 non-null float64 2 RH 1312 non-null float64 3 Gas 1312 non-null object 4 Value 1312 non-null float64 5 Unnamed: 5 0 non-null float64 6 Unnamed: 6 0 non-null float64 dtypes: float64(5), object(2) memory usage: 71.9+ KB
Observations
descriptive_stats = df_eda.describe(include="all").T
descriptive_stats["Proportion of Most Frequent Value"] = (
descriptive_stats["freq"] / len(df_eda) * 100
)
descriptive_stats.sort_values("Proportion of Most Frequent Value", ascending=False)
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | Proportion of Most Frequent Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gas | 1312 | 4 | CO | 328 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 25.0 |
| Date | 1312 | 328 | 15/3/2016 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.304878 |
| T | 1312.0 | NaN | NaN | NaN | 11.634917 | 37.041779 | -200.0 | 12.38776 | 18.832292 | 25.557118 | 32.997917 | NaN |
| RH | 1312.0 | NaN | NaN | NaN | 39.873729 | 43.217836 | -200.0 | 36.348177 | 46.078993 | 57.932292 | 81.104167 | NaN |
| Value | 1312.0 | NaN | NaN | NaN | 750.218839 | 225.520132 | -160.0 | 648.370833 | 769.370833 | 880.570833 | 1436.233333 | NaN |
| Unnamed: 5 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Unnamed: 6 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Observations
# converting date to a datetime object, then set as index
df_eda["Date"] = pd.to_datetime(df_eda["Date"], format="%d/%m/%Y")
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
df_eda.set_index("Date", inplace=True)
df.set_index("Date", inplace=True)
# dropping useless columns
df_eda = df_eda.loc[:, ~df_eda.columns.str.contains("^Unnamed")]
df = df.loc[:, ~df.columns.str.contains("^Unnamed")]
display(df)
| T | RH | Gas | Value | |
|---|---|---|---|---|
| Date | ||||
| 2016-03-15 | 12.020833 | 54.883334 | CO | 1053.200000 |
| 2016-03-16 | 9.833333 | 64.069791 | CO | 995.250000 |
| 2016-03-17 | 11.292708 | 51.107292 | CO | 1025.250000 |
| 2016-03-18 | 12.866319 | 51.530903 | CO | 1064.444444 |
| 2016-03-19 | 16.016667 | 48.843750 | CO | 1088.741667 |
| ... | ... | ... | ... | ... |
| 2017-02-01 | 5.267708 | 39.614930 | O3 | 553.180556 |
| 2017-02-02 | -55.515972 | -24.010417 | O3 | 343.500000 |
| 2017-02-03 | -14.272917 | 28.563542 | O3 | 334.458333 |
| 2017-02-04 | 4.848611 | 37.832986 | O3 | 379.513889 |
| 2017-02-05 | 7.273958 | 31.809375 | O3 | 947.333333 |
1312 rows × 4 columns
Pandas-Profiling is a convenient tool to quickly explore the datasets along with some alerts/warnings about the dataset
prof = ProfileReport(df_eda, explorative=True)
# prof.to_notebook_iframe()
# prof.to_file(output_file='AirPollution.html')
# Summary 1
print("Summary")
print(f"1. T is strong correlated with RH = {df_eda.corr()['T']['RH']}")
print(f"2. T is correlated with Value = {df_eda.corr()['T']['Value']}")
print(f"3. RH is correlated with Value = {df_eda.corr()['RH']['Value']}")
print(
f"4. The different gases are uniformly distributed: \n{df_eda['Gas'].value_counts()}"
)
Summary 1. T is strong correlated with RH = 0.9033185659007723 2. T is correlated with Value = 0.663510936586675 3. RH is correlated with Value = 0.6929274128393887 4. The different gases are uniformly distributed: CO 328 HC 328 NO2 328 O3 328 Name: Gas, dtype: int64
continuous = ["T", "RH", "Value"]
fig = plt.figure(tight_layout=True, figsize=(20, 15))
for i, column in enumerate(continuous):
ax = fig.add_subplot(len(continuous), 1, i + 1)
df_eda[column].plot(ax=ax)
ax.set_title("Time Changes of " + column, fontsize=16)
plt.show()
Summary / Observations
fig = plt.figure(tight_layout=True, figsize=(20, 15))
for i, c in enumerate(continuous):
ax = fig.add_subplot(len(continuous), 1, i + 1)
sns.lineplot(data=df_eda, x=df_eda.index, y=c, hue="Gas", ax=ax)
ax.set_title("Time Changes of " + c, fontsize=16)
plt.show()
Observations
We will begin with a univariate analysis, analysing the distribution of each variable.
We will plot and see the distribution of the different variables and how it relates to the different Gases
As mentioned, Temperatures are all the same for the different gases and therefore, we can just compare temperature's Distribution
plt.figure(figsize=(10, 10))
sns.histplot(data=df_eda, x="T", kde=True)
plt.title("Distributions of Temperature")
plt.show()
Observations
As mentioned, Relative Humidity are all the same for the different gases and therefore, we can just compare Relative Humidity's Distribution
plt.figure(figsize=(10, 10))
sns.histplot(data=df_eda, x="RH", kde=True)
plt.title("Distributions of Temperature")
plt.show()
Observations
As the values of the gases are varied, we will look at the distribution of each gases individually.
# Value
fig, ax = plt.subplots(2, 2, figsize=(20, 20))
gasColor = ["#ff0000", "#00ff00", "#0000ff", "#ffff00"]
for gasIndex in range(df_eda["Gas"].nunique()):
ax[gasIndex // 2][gasIndex % 2].hist(
df_eda[df_eda["Gas"] == df_eda["Gas"].unique()[gasIndex]]["Value"],
density=True,
bins=50,
color=gasColor[gasIndex] + "88",
)
sns.kdeplot(
df_eda[df_eda["Gas"] == df_eda["Gas"].unique()[gasIndex]]["Value"],
lw=3,
color=gasColor[gasIndex],
ax=ax[gasIndex // 2][gasIndex % 2],
)
ax[gasIndex // 2][gasIndex % 2].set_title(
f"Distribution of Value of {df_eda['Gas'].unique()[gasIndex]}"
)
plt.suptitle("Distributions of Value of Different Gases", fontsize=18, y=0.91)
plt.show()
Observations
Before we can fix some of the outlier issue caused by the failure sensor, we will have to perform some tests to decompose the time series problem.
These tests are:
Before we start any test, we will have to manipulate the data a bit
df_pivot = df_eda.copy(deep=True)
df_pivot = df_pivot.reset_index()
df_pivot = df_pivot.pivot(index="Date", columns="Gas", values="Value").join(
df_eda[["T", "RH"]]
)
df_pivot = df_pivot[~df_pivot.index.duplicated(keep="first")]
df_pivot
| CO | HC | NO2 | O3 | T | RH | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2016-03-15 | 1053.200000 | 729.800000 | 933.800000 | 876.833333 | 12.020833 | 54.883334 |
| 2016-03-16 | 995.250000 | 681.441667 | 1021.750000 | 708.025000 | 9.833333 | 64.069791 |
| 2016-03-17 | 1025.250000 | 806.583333 | 881.375000 | 867.375000 | 11.292708 | 51.107292 |
| 2016-03-18 | 1064.444444 | 794.258333 | 794.527778 | 996.625000 | 12.866319 | 51.530903 |
| 2016-03-19 | 1088.741667 | 755.083333 | 800.883333 | 987.341667 | 16.016667 | 48.843750 |
| ... | ... | ... | ... | ... | ... | ... |
| 2017-02-01 | 729.422222 | 562.650000 | 797.647222 | 553.180556 | 5.267708 | 39.614930 |
| 2017-02-02 | 474.291667 | 347.480556 | 508.180556 | 343.500000 | -55.515972 | -24.010417 |
| 2017-02-03 | 615.700000 | 414.475000 | 819.733333 | 334.458333 | -14.272917 | 28.563542 |
| 2017-02-04 | 691.713889 | 458.947222 | 909.375000 | 379.513889 | 4.848611 | 37.832986 |
| 2017-02-05 | 867.600000 | 751.833333 | 673.741667 | 947.333333 | 7.273958 | 31.809375 |
328 rows × 6 columns
The seasonal decomposition is a method used in time series analysis to represent a time series as a sum (or, sometimes, a product) of three components - the linear trend, the periodic (seasonal) component, and random residuals.
In a gist, the following plots will be able to split the data into its seasonal components, trend components and remainders. And if you were to sum up the values, you will have the actual data back.
for i in df_pivot.columns:
plt.rc("figure", figsize=(20, 10))
print("Seasonal Decomposition\nColumn:", i)
decomposition = seasonal_decompose(df_pivot[i])
print("Seasonal", len(decomposition.seasonal.drop_duplicates()))
decomposition.plot()
plt.show()
Seasonal Decomposition Column: CO Seasonal 7
Seasonal Decomposition Column: HC Seasonal 7
Seasonal Decomposition Column: NO2 Seasonal 7
Seasonal Decomposition Column: O3 Seasonal 7
Seasonal Decomposition Column: T Seasonal 7
Seasonal Decomposition Column: RH Seasonal 7
Observations
We will be using the Granger causality test. It is a statistical hypothesis test for determining whether one time series is a factor and offer useful information in forecasting another time series. This will help us in concluding if a Multivariate Modelling can be used for this time series.
Granger causality test works like this.
$H_0$: There is no causality between time series
$H_1$: There is some causality between time series
maxlag = 12
test = "ssr_chi2test"
g_matrix = pd.DataFrame(
np.zeros((len(df_pivot.columns), len(df_pivot.columns))),
columns=df_pivot.columns,
index=df_pivot.columns,
)
for c in g_matrix.columns:
for r in g_matrix.index:
test_result = grangercausalitytests(
df_pivot[[r, c]], maxlag=maxlag, verbose=False
)
p_values = [round(test_result[i + 1][0][test][1], 4) for i in range(maxlag)]
min_p_value = np.min(p_values)
g_matrix.loc[r, c] = min_p_value
g_matrix.columns = [var + "_x" for var in df_pivot.columns]
g_matrix.index = [var + "_y" for var in df_pivot.columns]
g_matrix
| CO_x | HC_x | NO2_x | O3_x | T_x | RH_x | |
|---|---|---|---|---|---|---|
| CO_y | 1.0000 | 0.1086 | 0.2394 | 0.2212 | 0.0350 | 0.0250 |
| HC_y | 0.0002 | 1.0000 | 0.0111 | 0.3177 | 0.0068 | 0.0020 |
| NO2_y | 0.4698 | 0.1523 | 1.0000 | 0.0833 | 0.1702 | 0.3143 |
| O3_y | 0.0049 | 0.3991 | 0.0221 | 1.0000 | 0.0140 | 0.0285 |
| T_y | 0.6076 | 0.7240 | 0.2505 | 0.5610 | 1.0000 | 0.1257 |
| RH_y | 0.2407 | 0.3673 | 0.0305 | 0.1439 | 0.0380 | 1.0000 |
plt.figure(figsize=(10, 10))
sns.heatmap(g_matrix, annot=True, cmap="Blues", vmin=0, vmax=1)
plt.title("Granger's Test for Causality")
plt.show()
Observations
Cointegration is a statistical method used to test the correlation between two or more non-stationary time series in the long-run or for a specified time period. The method helps in identifying long-run parameters or equilibrium for two or more sets of variables.
We will be use Johansen's cointegration test is used to established if there is a correlation between several time series in the long term, to determine if there is any significant relationship to use for Multivariate Modelling.
$H_0$: There is no cointegration
$H_1$: There is some cotintegram
out = coint_johansen(df_pivot, -1, 5)
d = {"0.90": 0, "0.95": 1, "0.99": 2}
traces = out.lr1
cvts = out.cvt[:, d[str(1 - 0.05)]]
def adjust(val, length=6):
return str(val).ljust(length)
# Summary
print("Name :: Test Stat > C(95%) => Significant \n", "--" * 20)
for col, trace, cvt in zip(df_pivot.columns, traces, cvts):
print(
adjust(col),
":: ",
adjust(round(trace, 2), 9),
">",
adjust(cvt, 8),
" => ",
trace > cvt,
)
Name :: Test Stat > C(95%) => Significant ---------------------------------------- CO :: 151.09 > 83.9383 => True HC :: 90.83 > 60.0627 => True NO2 :: 45.03 > 40.1749 => True O3 :: 19.66 > 24.2761 => False T :: 4.85 > 12.3212 => False RH :: 0.64 > 4.1296 => False
Observations
Stationarity can be defined in precise mathematical terms, but for our purpose we mean a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality).
This can be tested using the Augmented Dickey-Fuller test also known as the AD Fuller Test to test for stationarity.
$H_0$: Data is not stationary
$H_1$: Data is stationary
def adfuller_test(series, significant=0.05, name="", verbose=False):
r = adfuller(series, autolag="AIC")
output = {
"test_statistic": round(r[0], 4),
"p_value": round(r[1], 4),
"n_lags": round(r[2], 4),
"n_obs": r[3],
}
p_value = output["p_value"]
def adjust(val, length=6):
return str(val).ljust(length)
# Print Summary
print(f'Augmented Dickey-Fuller Test on "{name}"', "\n", "-" * 47)
print(f"Null Hypothesis: Data has unit root. Non-Stationary.")
print(f"Significance Level = {significant}")
print(f'Test Statistic = {output["test_statistic"]}')
print(f'No. Lags Chosen = {output["n_lags"]}')
for key, val in r[4].items():
print(f" Critical value {adjust(key)} = {round(val, 3)}")
if p_value <= significant:
print(f"=> P-Value = {p_value}. Rejecting Null Hypothesis.")
print(f"=> Series is Stationary.")
else:
print(f"=> P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
print(f"=> Series is Non-Stationary.")
for name, column in df_pivot.iteritems():
adfuller_test(column, name=column.name)
print("\n")
Augmented Dickey-Fuller Test on "CO" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.5435 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "HC" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.6683 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "NO2" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -10.0552 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "O3" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.5691 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "T" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -10.251 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "RH" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -10.2458 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary.
Observations
As stated in the distribution part, the dataset has many different outliers and the data is skewed. This data should be voided, deskew or interpolated. We will analysis and see what is the suitable method and we will check if in the end is there any stationary columns.
plt.figure(figsize=(10, 10))
df_pivot.hist(bins="auto")
plt.show()
<Figure size 720x720 with 0 Axes>
Observations
As the lowest temperature recorded on earth is -89.2°C, the lowest relative humidity recorded is 1 percent therefore, any values that are less than -89.2°C and 1 percent temperature and relative humidity respectively should be interpolated.
Interpolation is the process of estimating unknown values that falls between 2 known points. We will be using the linear method to interpolate the data using the interpolate method in pandas. [Draws a straight line between the 2 points and estimate using the midway data of the 2 points]
df_inter = df_pivot.copy(deep=True)
df_inter.loc[df_inter["T"] < -89.2, "T"] = np.nan
df_inter.loc[df_inter["RH"] < 0, "RH"] = np.nan
df_inter = df_inter.interpolate(method="time", limit_direction="forward")
df_inter
| CO | HC | NO2 | O3 | T | RH | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2016-03-15 | 1053.200000 | 729.800000 | 933.800000 | 876.833333 | 12.020833 | 54.883334 |
| 2016-03-16 | 995.250000 | 681.441667 | 1021.750000 | 708.025000 | 9.833333 | 64.069791 |
| 2016-03-17 | 1025.250000 | 806.583333 | 881.375000 | 867.375000 | 11.292708 | 51.107292 |
| 2016-03-18 | 1064.444444 | 794.258333 | 794.527778 | 996.625000 | 12.866319 | 51.530903 |
| 2016-03-19 | 1088.741667 | 755.083333 | 800.883333 | 987.341667 | 16.016667 | 48.843750 |
| ... | ... | ... | ... | ... | ... | ... |
| 2017-02-01 | 729.422222 | 562.650000 | 797.647222 | 553.180556 | 5.267708 | 39.614930 |
| 2017-02-02 | 474.291667 | 347.480556 | 508.180556 | 343.500000 | -55.515972 | 34.089236 |
| 2017-02-03 | 615.700000 | 414.475000 | 819.733333 | 334.458333 | -14.272917 | 28.563542 |
| 2017-02-04 | 691.713889 | 458.947222 | 909.375000 | 379.513889 | 4.848611 | 37.832986 |
| 2017-02-05 | 867.600000 | 751.833333 | 673.741667 | 947.333333 | 7.273958 | 31.809375 |
328 rows × 6 columns
plt.title("Time Changes for both T and RH")
sns.lineplot(data=df_inter, x=df_inter.index, y="T")
sns.lineplot(data=df_inter, x=df_inter.index, y="RH")
plt.legend(
["T", "RH"],
loc="upper center",
bbox_to_anchor=(0.5, -0.1),
fancybox=True,
shadow=True,
ncol=3,
)
plt.show()
Observations
After interpolating the data, there is a slight decrease and the dips in the data are more reasonable. However, as we have manipulated the data, the data might not be stationary any more therefore, we will have to do another AD Fuller Test
for name, column in df_inter.iteritems():
adfuller_test(column, name=column.name)
print("\n")
Augmented Dickey-Fuller Test on "CO" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.5435 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "HC" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.6683 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "NO2" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -10.0552 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "O3" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -9.5691 No. Lags Chosen = 1 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "T" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -8.5363 No. Lags Chosen = 0 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary. Augmented Dickey-Fuller Test on "RH" ----------------------------------------------- Null Hypothesis: Data has unit root. Non-Stationary. Significance Level = 0.05 Test Statistic = -4.8428 No. Lags Chosen = 2 Critical value 1% = -3.451 Critical value 5% = -2.87 Critical value 10% = -2.572 => P-Value = 0.0. Rejecting Null Hypothesis. => Series is Stationary.
Observations
Autocorrelation is intended to measure the relationship between a variable's present value and any past values that you may have access to. In a sense, autocorrelations can help us to see how the previous can be useful for predicting the current value of gases. We will be using ACF and PACF to do comparison.
Autocorrelation: the degree of similarity between the time series and it's lagged version.
Partial Autocorrelation: the degree similarity between the time series and it's lagged version, without account the mutual relationship with other variables.
The plots will help us in the interpretation of the p and q values for our models below later on.
We will be also doing a comparison with the interpolated data as well as the non interpolated data
for i in df_inter.columns:
fig, ax = plt.subplots(1, 2, tight_layout=True)
plot_acf(df_inter[i], ax=ax[0])
plot_pacf(df_inter[i], ax=ax[1])
fig.suptitle(i, fontsize=16)
plt.show()
Observations
for i in df_pivot.columns:
fig, ax = plt.subplots(1, 2, tight_layout=True)
plot_acf(df_pivot[i], ax=ax[0])
plot_pacf(df_pivot[i], ax=ax[1])
fig.suptitle(i, fontsize=16)
plt.show()
Observations
This values will be used to help us to get the perfect number of AR and MA to train our models
CO (p,q) = (1,2)
HC (p,q) = (1,2)
NO2 (p,q) = (1,2)
O3 (p,q) = (1,2)
There are many models for time series. We will be using 3 models and doing a comparison.
ARIMA vs SARIMA: Autoregressive Integrated Moving Average(ARIMA) is an forecasting algorithm which takes intro account the past values and predict the future value based on the autoregressive and moving average. Seasonal Autoregressive Integrated Moving Average(SARIMA) is very similar to ARIMA but SARIMA takes into account the seasonality patterns in the time series.
ARIMA & SARIMA vs VARMAX Both ARIMA & SARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation. While the VARMAX procedure enables you to model the dynamic relationship both between the dependent variables and also between the dependent and independent variables. VARMAX models are defined in terms of the orders of the autoregressive or moving-average process (or both).
The gist is that ARIMA & SARIMA is used for univariate data while VARMAX is used for multivariate data. SARIMA takes into account seasonality and ARIMA does not.
Before we start making the models, we need to make sure the dates in our data has been formatted to individual days and remove all na values
df_model = df_pivot.copy().dropna().to_period("d")
df_model
| CO | HC | NO2 | O3 | T | RH | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2016-03-15 | 1053.200000 | 729.800000 | 933.800000 | 876.833333 | 12.020833 | 54.883334 |
| 2016-03-16 | 995.250000 | 681.441667 | 1021.750000 | 708.025000 | 9.833333 | 64.069791 |
| 2016-03-17 | 1025.250000 | 806.583333 | 881.375000 | 867.375000 | 11.292708 | 51.107292 |
| 2016-03-18 | 1064.444444 | 794.258333 | 794.527778 | 996.625000 | 12.866319 | 51.530903 |
| 2016-03-19 | 1088.741667 | 755.083333 | 800.883333 | 987.341667 | 16.016667 | 48.843750 |
| ... | ... | ... | ... | ... | ... | ... |
| 2017-02-01 | 729.422222 | 562.650000 | 797.647222 | 553.180556 | 5.267708 | 39.614930 |
| 2017-02-02 | 474.291667 | 347.480556 | 508.180556 | 343.500000 | -55.515972 | -24.010417 |
| 2017-02-03 | 615.700000 | 414.475000 | 819.733333 | 334.458333 | -14.272917 | 28.563542 |
| 2017-02-04 | 691.713889 | 458.947222 | 909.375000 | 379.513889 | 4.848611 | 37.832986 |
| 2017-02-05 | 867.600000 | 751.833333 | 673.741667 | 947.333333 | 7.273958 | 31.809375 |
328 rows × 6 columns
df_inter_model = df_inter.copy().dropna().to_period("d")
df_inter_model
| CO | HC | NO2 | O3 | T | RH | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2016-03-15 | 1053.200000 | 729.800000 | 933.800000 | 876.833333 | 12.020833 | 54.883334 |
| 2016-03-16 | 995.250000 | 681.441667 | 1021.750000 | 708.025000 | 9.833333 | 64.069791 |
| 2016-03-17 | 1025.250000 | 806.583333 | 881.375000 | 867.375000 | 11.292708 | 51.107292 |
| 2016-03-18 | 1064.444444 | 794.258333 | 794.527778 | 996.625000 | 12.866319 | 51.530903 |
| 2016-03-19 | 1088.741667 | 755.083333 | 800.883333 | 987.341667 | 16.016667 | 48.843750 |
| ... | ... | ... | ... | ... | ... | ... |
| 2017-02-01 | 729.422222 | 562.650000 | 797.647222 | 553.180556 | 5.267708 | 39.614930 |
| 2017-02-02 | 474.291667 | 347.480556 | 508.180556 | 343.500000 | -55.515972 | 34.089236 |
| 2017-02-03 | 615.700000 | 414.475000 | 819.733333 | 334.458333 | -14.272917 | 28.563542 |
| 2017-02-04 | 691.713889 | 458.947222 | 909.375000 | 379.513889 | 4.848611 | 37.832986 |
| 2017-02-05 | 867.600000 | 751.833333 | 673.741667 | 947.333333 | 7.273958 | 31.809375 |
328 rows × 6 columns
x, y = df_model[["T", "RH"]], df_model.drop(["T", "RH"], axis=1)
print(x.shape, y.shape)
(328, 2) (328, 4)
train_size = int(len(df_model) * 0.8)
x_train, x_test = x[:train_size], x[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
print(f"x_train.shape = {x_train.shape}, y_train.shape = {y_train.shape}")
print(f"x_test.shape = {x_test.shape}, y_test.shape = {y_test.shape}")
x_train.shape = (262, 2), y_train.shape = (262, 4) x_test.shape = (66, 2), y_test.shape = (66, 4)
x_inter, y_inter = df_inter_model[["T", "RH"]], df_inter_model.drop(["T", "RH"], axis=1)
print(x_inter.shape, y_inter.shape)
(328, 2) (328, 4)
train_size = int(len(df_inter_model) * 0.8)
x_inter_train, x_inter_test = x_inter[:train_size], x_inter[train_size:]
y_inter_train, y_inter_test = y_inter[:train_size], y_inter[train_size:]
print(f"x_train.shape = {x_inter_train.shape}, y_train.shape = {y_inter_train.shape}")
print(f"x_test.shape = {x_inter_test.shape}, y_test.shape = {y_inter_test.shape}")
x_train.shape = (262, 2), y_train.shape = (262, 4) x_test.shape = (66, 2), y_test.shape = (66, 4)
We will need to make more changes to the dataset as ARIMA & SARIMA is only able to train on time series at a time which means we will have to break the data into 4 different parts for the 4 different Gases.
GAS_List = ["CO", "HC", "NO2", "O3"]
CO_train, CO_test = y_train["CO"], y_test["CO"]
HC_train, HC_test = y_train["HC"], y_test["HC"]
NO2_train, NO2_test = y_train["NO2"], y_test["NO2"]
O3_train, O3_test = y_train["O3"], y_test["O3"]
GAS_train = [CO_train, HC_train, NO2_train, O3_train]
GAS_test = [CO_test, HC_test, NO2_test, O3_test]
CO_inter_train, CO_inter_test = y_inter_train["CO"], y_inter_test["CO"]
HC_inter_train, HC_inter_test = y_inter_train["HC"], y_inter_test["HC"]
NO2_inter_train, NO2_inter_test = y_inter_train["NO2"], y_inter_test["NO2"]
O3_inter_train, O3_inter_test = y_inter_train["O3"], y_inter_test["O3"]
GAS_inter_train = [CO_inter_train, HC_inter_train, NO2_inter_train, O3_inter_train]
GAS_inter_test = [CO_inter_test, HC_inter_test, NO2_inter_test, O3_inter_test]
model_Arr = []
for i in range(len(GAS_train)):
model_Arr.append(ARIMA(endog=GAS_train[i], exog=x_train, order=(1, 1, 1)).fit())
print(model_Arr[i].summary())
model_Arr[i].plot_diagnostics()
plt.show()
SARIMAX Results
==============================================================================
Dep. Variable: CO No. Observations: 262
Model: ARIMA(1, 1, 1) Log Likelihood -1476.857
Date: Thu, 11 Aug 2022 AIC 2963.714
Time: 19:11:54 BIC 2981.536
Sample: 03-15-2016 HQIC 2970.878
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 1.1795 0.584 2.021 0.043 0.036 2.323
RH 3.0268 0.511 5.922 0.000 2.025 4.029
ar.L1 0.6957 0.060 11.607 0.000 0.578 0.813
ma.L1 -0.9727 0.025 -39.233 0.000 -1.021 -0.924
sigma2 4789.3585 371.055 12.907 0.000 4062.104 5516.613
===================================================================================
Ljung-Box (L1) (Q): 4.89 Jarque-Bera (JB): 7.63
Prob(Q): 0.03 Prob(JB): 0.02
Heteroskedasticity (H): 1.32 Skew: -0.25
Prob(H) (two-sided): 0.20 Kurtosis: 3.67
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==============================================================================
Dep. Variable: HC No. Observations: 262
Model: ARIMA(1, 1, 1) Log Likelihood -1534.437
Date: Thu, 11 Aug 2022 AIC 3078.874
Time: 19:11:57 BIC 3096.696
Sample: 03-15-2016 HQIC 3086.038
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 3.0700 0.762 4.028 0.000 1.576 4.564
RH 0.8775 0.670 1.309 0.191 -0.437 2.192
ar.L1 0.5595 0.060 9.388 0.000 0.443 0.676
ma.L1 -0.9634 0.022 -43.311 0.000 -1.007 -0.920
sigma2 7440.1368 653.773 11.380 0.000 6158.766 8721.508
===================================================================================
Ljung-Box (L1) (Q): 4.19 Jarque-Bera (JB): 1.12
Prob(Q): 0.04 Prob(JB): 0.57
Heteroskedasticity (H): 1.73 Skew: -0.14
Prob(H) (two-sided): 0.01 Kurtosis: 3.15
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==============================================================================
Dep. Variable: NO2 No. Observations: 262
Model: ARIMA(1, 1, 1) Log Likelihood -1565.483
Date: Thu, 11 Aug 2022 AIC 3140.965
Time: 19:12:00 BIC 3158.788
Sample: 03-15-2016 HQIC 3148.130
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 3.9984 0.873 4.578 0.000 2.287 5.710
RH -0.2482 0.756 -0.328 0.743 -1.729 1.233
ar.L1 0.5797 0.055 10.451 0.000 0.471 0.688
ma.L1 -0.9589 0.025 -38.017 0.000 -1.008 -0.909
sigma2 9448.0317 314.927 30.001 0.000 8830.787 1.01e+04
===================================================================================
Ljung-Box (L1) (Q): 1.61 Jarque-Bera (JB): 2511.97
Prob(Q): 0.20 Prob(JB): 0.00
Heteroskedasticity (H): 1.71 Skew: 2.52
Prob(H) (two-sided): 0.01 Kurtosis: 17.34
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==============================================================================
Dep. Variable: O3 No. Observations: 262
Model: ARIMA(1, 1, 1) Log Likelihood -1687.731
Date: Thu, 11 Aug 2022 AIC 3385.461
Time: 19:12:02 BIC 3403.284
Sample: 03-15-2016 HQIC 3392.625
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 2.6711 1.264 2.113 0.035 0.194 5.149
RH 1.1397 1.104 1.032 0.302 -1.024 3.304
ar.L1 -0.7332 0.365 -2.008 0.045 -1.449 -0.017
ma.L1 0.7826 0.330 2.373 0.018 0.136 1.429
sigma2 2.422e+04 2153.107 11.248 0.000 2e+04 2.84e+04
===================================================================================
Ljung-Box (L1) (Q): 0.40 Jarque-Bera (JB): 0.01
Prob(Q): 0.53 Prob(JB): 0.99
Heteroskedasticity (H): 1.32 Skew: 0.01
Prob(H) (two-sided): 0.19 Kurtosis: 2.99
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Observations
In the P>|z| column, it represents the term significance using a hypothesis testing with the following hypothesis
$H_0$: Each coefficient is NOT statistically significant
$H_1$: Each coefficient is statistically significant
The Ljung-Box (L1) (Q) is the LBQ test statistic at lag 1. It is used as a hypothesis testing for the following hypothesis
$H_0$: The errors are white noise
$H_1$: The errors are not white noise
Heteroskedasticity (H) tests that the error residuals are homoscedastic or have the same variance.
$H_0$: The residuals show variances
$H_1$: The residuals show no variances
Jarque-Bera tests for the normality of errors.
$H_0$: Data is normally distributed against an alternative of another distribution
$H_1$: Data is not normally distributed against an alternative of another distribution
The log-likelihood function identifies a distribution that fits best with the sampled data.
Akaike’s Information Criterion (AIC) helps determine the strength of the linear regression model. The AIC penalizes a model for adding parameters since adding more parameters will always increase the maximum likelihood value.
Bayesian Information Criterion (BIC), like the AIC, also punishes a model for complexity, but it also incorporates the number of rows in the data.
Hannan-Quinn Information Criterion (HQIC), like AIC and BIC, is another criterion for model selection; however, it’s not used as often in practice.
Residuals
The difference between the observations and the corresponding fitted values.
$$ e_{t} = y_{t} - \hat{y}_{t} $$A good forecast will yield the following:
Individual Gases Observations
In conclusion, each model will need to be tuned further using their respective orders collected through the autocorrelation and partial autocorrelation plots. Based on the residuals from all the models, it is generally centered to 0 which means it is generally correct however there are a lot of spikes which means the residuals are very staggered. The histogram suggest that the residual may not be normal as they are left skewed, left skewed, right skewed and right skewed respectively for CO,HC,NO2,O3. There is also a spike in correlogram value of 7. This is likely due to the fact that there is a seasonality of 7 in this Time Series which makes lag 7 an important values. This will be remedied using SARIMA as it takes into account for seasonality.
CO_model = ARIMA(endog=CO_train, exog=x_train, order=(1, 1, 2)).fit()
HC_model = ARIMA(endog=HC_train, exog=x_train, order=(1, 1, 2)).fit()
NO2_model = ARIMA(endog=NO2_train, exog=x_train, order=(1, 1, 2)).fit()
O3_model = ARIMA(endog=O3_train, exog=x_train, order=(1, 1, 2)).fit()
CO_pred = CO_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
HC_pred = HC_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
NO2_pred = NO2_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
O3_pred = O3_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
GAS_pred = [CO_pred, HC_pred, NO2_pred, O3_pred]
for i in range(len(GAS_pred)):
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
GAS_pred[i].plot(ax=ax)
GAS_test[i].plot(ax=ax)
plt.title(GAS_List[i])
plt.legend(
["ARIMA Predictions", "Actual Forecast"],
loc="upper center",
bbox_to_anchor=(0.5, -0.15),
fancybox=True,
shadow=True,
ncol=3,
)
plt.show()
CO_inter_model = ARIMA(endog=CO_inter_train, exog=x_inter_train, order=(1, 1, 2)).fit()
HC_inter_model = ARIMA(endog=HC_inter_train, exog=x_inter_train, order=(1, 1, 2)).fit()
NO2_inter_model = ARIMA(
endog=NO2_inter_train, exog=x_inter_train, order=(1, 1, 2)
).fit()
O3_inter_model = ARIMA(endog=O3_inter_train, exog=x_inter_train, order=(1, 1, 2)).fit()
CO_inter_pred = CO_inter_model.predict(
start=x_inter_test.index[0], end=x_inter_test.index[-1], exog=x_inter_test
)
HC_inter_pred = HC_inter_model.predict(
start=x_inter_test.index[0], end=x_inter_test.index[-1], exog=x_inter_test
)
NO2_inter_pred = NO2_inter_model.predict(
start=x_inter_test.index[0], end=x_inter_test.index[-1], exog=x_inter_test
)
O3_inter_pred = O3_inter_model.predict(
start=x_inter_test.index[0], end=x_inter_test.index[-1], exog=x_inter_test
)
GAS_inter_pred = [CO_inter_pred, HC_inter_pred, NO2_inter_pred, O3_inter_pred]
for i in range(len(GAS_inter_pred)):
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
GAS_inter_pred[i].plot(ax=ax)
GAS_inter_test[i].plot(ax=ax)
plt.title(GAS_List[i])
plt.legend(
["ARIMA Predictions", "Actual Forecast"],
loc="upper center",
bbox_to_anchor=(0.5, -0.15),
fancybox=True,
shadow=True,
ncol=3,
)
plt.show()
As we can see from the predicted values and graphs, the interpolated dataset did not perform as well as the original dataset. Furthermore, future readings may also fail which we might also need to predict so that we know when the sensors could potentially fail at and fix it in case of emergencies. Because of this, we will be using the no interpolated data(original data) for the rest of the time series predictions.
SARIMA will be used to compare with the baseline ARIMA in terms of performance. It should perform better due to SARIMA using more computation power to predict inclusive with seasons. Based on seasonality decomposition, we know that all variables have a seasonality of 7.
model_Arr = []
for i in range(len(GAS_train)):
model_Arr.append(
SARIMAX(
endog=GAS_train[i],
exog=x_train,
order=(1, 1, 1),
seasonal_order=(2, 0, 0, 7),
).fit()
)
print(model_Arr[i].summary())
model_Arr[i].plot_diagnostics()
plt.show()
SARIMAX Results
==========================================================================================
Dep. Variable: CO No. Observations: 262
Model: SARIMAX(1, 1, 1)x(2, 0, [], 7) Log Likelihood -1468.197
Date: Thu, 11 Aug 2022 AIC 2950.394
Time: 19:12:24 BIC 2975.346
Sample: 03-15-2016 HQIC 2960.424
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 1.2716 0.544 2.336 0.019 0.205 2.338
RH 2.9454 0.481 6.124 0.000 2.003 3.888
ar.L1 0.7330 0.047 15.505 0.000 0.640 0.826
ma.L1 -0.9991 0.101 -9.870 0.000 -1.198 -0.801
ar.S.L7 0.2172 0.060 3.639 0.000 0.100 0.334
ar.S.L14 0.1335 0.068 1.976 0.048 0.001 0.266
sigma2 4446.0492 514.360 8.644 0.000 3437.922 5454.176
===================================================================================
Ljung-Box (L1) (Q): 2.69 Jarque-Bera (JB): 5.30
Prob(Q): 0.10 Prob(JB): 0.07
Heteroskedasticity (H): 1.29 Skew: -0.12
Prob(H) (two-sided): 0.23 Kurtosis: 3.66
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==========================================================================================
Dep. Variable: HC No. Observations: 262
Model: SARIMAX(1, 1, 1)x(2, 0, [], 7) Log Likelihood -1489.902
Date: Thu, 11 Aug 2022 AIC 2993.804
Time: 19:12:28 BIC 3018.756
Sample: 03-15-2016 HQIC 3003.834
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 3.5025 0.609 5.752 0.000 2.309 4.696
RH 0.4632 0.538 0.861 0.389 -0.591 1.518
ar.L1 0.6368 0.052 12.220 0.000 0.535 0.739
ma.L1 -0.9996 0.291 -3.438 0.001 -1.569 -0.430
ar.S.L7 0.3968 0.058 6.790 0.000 0.282 0.511
ar.S.L14 0.2517 0.063 3.971 0.000 0.127 0.376
sigma2 5212.3591 1465.851 3.556 0.000 2339.344 8085.374
===================================================================================
Ljung-Box (L1) (Q): 0.30 Jarque-Bera (JB): 2.28
Prob(Q): 0.59 Prob(JB): 0.32
Heteroskedasticity (H): 1.70 Skew: 0.14
Prob(H) (two-sided): 0.01 Kurtosis: 3.37
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==========================================================================================
Dep. Variable: NO2 No. Observations: 262
Model: SARIMAX(1, 1, 1)x(2, 0, [], 7) Log Likelihood -1552.128
Date: Thu, 11 Aug 2022 AIC 3118.256
Time: 19:12:31 BIC 3143.207
Sample: 03-15-2016 HQIC 3128.285
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 3.9018 0.823 4.740 0.000 2.288 5.515
RH -0.1408 0.722 -0.195 0.845 -1.556 1.274
ar.L1 0.6137 0.049 12.652 0.000 0.519 0.709
ma.L1 -0.9771 0.021 -46.043 0.000 -1.019 -0.935
ar.S.L7 0.2300 0.048 4.756 0.000 0.135 0.325
ar.S.L14 0.1949 0.074 2.632 0.008 0.050 0.340
sigma2 8503.6958 417.942 20.347 0.000 7684.545 9322.846
===================================================================================
Ljung-Box (L1) (Q): 0.32 Jarque-Bera (JB): 1549.65
Prob(Q): 0.57 Prob(JB): 0.00
Heteroskedasticity (H): 1.75 Skew: 2.10
Prob(H) (two-sided): 0.01 Kurtosis: 14.18
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMAX Results
==========================================================================================
Dep. Variable: O3 No. Observations: 262
Model: SARIMAX(1, 1, 1)x(2, 0, [], 7) Log Likelihood -1646.827
Date: Thu, 11 Aug 2022 AIC 3307.653
Time: 19:12:36 BIC 3332.605
Sample: 03-15-2016 HQIC 3317.683
- 12-01-2016
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
T 2.3490 1.145 2.052 0.040 0.106 4.592
RH 1.4634 1.029 1.423 0.155 -0.553 3.479
ar.L1 0.6719 0.055 12.317 0.000 0.565 0.779
ma.L1 -0.9998 0.566 -1.766 0.077 -2.109 0.110
ar.S.L7 0.2878 0.064 4.528 0.000 0.163 0.412
ar.S.L14 0.1999 0.070 2.853 0.004 0.063 0.337
sigma2 1.742e+04 9288.476 1.876 0.061 -782.397 3.56e+04
===================================================================================
Ljung-Box (L1) (Q): 1.67 Jarque-Bera (JB): 1.79
Prob(Q): 0.20 Prob(JB): 0.41
Heteroskedasticity (H): 1.35 Skew: 0.14
Prob(H) (two-sided): 0.16 Kurtosis: 3.29
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Observations
In conclusion, each model will need to be tuned further using their respective orders collected through the autocorrelation and partial autocorrelation plots. Compared to the ARIMA model, SARIMA performed better than it as the residual plot is more close to 0 with slightly less variation. In the correlogram plot, there is no lag that is outside of the 5% range which is good.
CO_model = SARIMAX(
endog=CO_train, exog=x_train, order=(1, 1, 2), seasonal_order=(2, 0, 0, 7)
).fit()
HC_model = SARIMAX(
endog=HC_train, exog=x_train, order=(1, 1, 2), seasonal_order=(2, 0, 0, 7)
).fit()
NO2_model = SARIMAX(
endog=NO2_train, exog=x_train, order=(1, 1, 2), seasonal_order=(2, 0, 0, 7)
).fit()
O3_model = SARIMAX(
endog=O3_train, exog=x_train, order=(1, 1, 2), seasonal_order=(2, 0, 0, 7)
).fit()
CO_pred = CO_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
HC_pred = HC_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
NO2_pred = NO2_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
O3_pred = O3_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
GAS_pred = [CO_pred, HC_pred, NO2_pred, O3_pred]
for i in range(len(GAS_pred)):
fig, ax = plt.subplots(1, 1, figsize=(20, 10))
GAS_pred[i].plot(ax=ax)
GAS_test[i].plot(ax=ax)
plt.title(GAS_List[i])
plt.legend(
["ARIMA Predictions", "Actual Forecast"],
loc="upper center",
bbox_to_anchor=(0.5, -0.15),
fancybox=True,
shadow=True,
ncol=3,
)
plt.show()
When comparing Time Series models, we look at the AIC and BIC. The lower the AIC and BIC, the better the models performance. AIC as mentioned helps determine the strength of the linear regression model and would emphasis more on model performance (select more complex models) and BIC as mentioned helps to select the true model of the dataset which follows more strictly to the original training dataset this causes the it to select a more simple model which focuses more on getting the data to work. This means AIC and BIC needs to be balanced.
Based on the SARIMA and ARIMA summary, the SARIMA models have a decrease in AIC and BIC score by around 50 - 300 points. This means that with the inclusion of the seasonality, the model performs better.
Furthermore, based on the side by side comparision of the model prediction on the validation set it seems SARIMA follows the trend of the actual forecast more closely compared to ARIMA.
As Time Series is similar to a regression problem, we will be using the following scoring metrics to help us in the time series
CO_train_pred = CO_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
HC_train_pred = HC_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
NO2_train_pred = NO2_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
O3_train_pred = O3_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
GAS_train_pred = [CO_train_pred, HC_train_pred, NO2_train_pred, O3_train_pred]
for i in range(len(GAS_List)):
print(
f"{GAS_List[i]} Train MAPE:\t\t{mean_absolute_percentage_error(y_train[GAS_List[i]],GAS_train_pred[i]) * 100:.3f}%"
)
print(
f"{GAS_List[i]} Train RMSE:\t\t{mean_squared_error(y_train[GAS_List[i]],GAS_train_pred[i],squared=False):.3f}"
)
print(
f"{GAS_List[i]} Validation MAPE:\t{mean_absolute_percentage_error(y_test[GAS_List[i]],GAS_pred[i]) * 100:.3f}%"
)
print(
f"{GAS_List[i]} Validation RMSE:\t{mean_squared_error(y_test[GAS_List[i]],GAS_pred[i],squared=False):.3f}"
)
print()
CO Train MAPE: 6.928% CO Train RMSE: 85.607 CO Validation MAPE: 15.369% CO Validation RMSE: 109.315 HC Train MAPE: 9.431% HC Train RMSE: 83.326 HC Validation MAPE: 981.828% HC Validation RMSE: 131.328 NO2 Train MAPE: 11.050% NO2 Train RMSE: 107.408 NO2 Validation MAPE: 21.158% NO2 Validation RMSE: 184.255 O3 Train MAPE: 38.968% O3 Train RMSE: 140.103 O3 Validation MAPE: 60.418% O3 Validation RMSE: 267.255
Observations
To validate and make sure the model is evaluated general potential. One way to achieve this is to train the data using the passage of time in different windows.
def model_cv(
model, endog, exog, order=(1, 1, 1), splits=3, seasonal_order=None, max_iter=1000
):
valid_rmse = []
valid_mape = []
train_rmse = []
train_mape = []
model_aic = []
model_bic = []
time_series = TimeSeriesSplit(n_splits=splits)
for train_index, test_index in time_series.split(exog):
X_train, X_test = exog.iloc[train_index], exog.iloc[test_index]
y_train, y_test = endog.iloc[train_index], endog.iloc[test_index]
if seasonal_order is None:
model_fit = model(endog=y_train, exog=X_train, order=order).fit(
maxiter=max_iter, disp=False
)
else:
model_fit = model(
endog=y_train, exog=X_train, seasonal_order=seasonal_order
).fit(maxiter=max_iter, disp=False)
y_pred = model_fit.predict(
start=X_test.index[0], end=X_test.index[-1], exog=X_test
)
valid_rmse.append(mean_squared_error(y_test, y_pred, squared=False))
valid_mape.append(mean_absolute_percentage_error(y_test, y_pred))
train_pred = model_fit.predict(
start=X_train.index[0], end=X_train.index[-1], exog=X_train
)
train_rmse.append(mean_squared_error(y_train, train_pred, squared=False))
train_mape.append(mean_absolute_percentage_error(y_train, train_pred))
model_aic.append(model_fit.aic)
model_bic.append(model_fit.bic)
return (
pd.Series(
{
"train_rmse": np.mean(train_rmse),
"train_rmse_std": np.std(train_rmse),
"valid_rmse": np.mean(valid_rmse),
"valid_rmse_std": np.std(valid_rmse),
"train_mape": np.mean(train_mape),
"valid_mape": np.mean(valid_mape),
"AIC": np.mean(model_aic),
"BIC": np.mean(model_bic),
},
name=order,
),
pd.DataFrame(
{
"train_rmse": train_rmse,
"train_rmse_std": train_rmse,
"valid_rmse": valid_rmse,
"valid_rmse_std": valid_rmse,
"train_mape": train_mape,
"valid_mape": valid_mape,
"AIC": model_aic,
"BIC": model_bic,
}
),
)
avg_scores = []
cv_table = []
for i in GAS_List:
score, cv = model_cv(
SARIMAX, exog=x, endog=y[i], splits=5, seasonal_order=(1, 1, 2, 7)
)
avg_scores.append(score)
cv_table.append(cv)
for i in range(len(cv_table)):
print(GAS_List[i])
display(
cv_table[i]
.style.apply(
lambda x: [
"background-color: green; color: white" if v else ""
for v in x == x.min()
]
)
.apply(
lambda x: [
"background-color: red; color: white" if v else "" for v in x == x.max()
]
)
)
CO
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| 0 | 357.446249 | 357.446249 | 131.904478 | 131.904478 | 0.173741 | 0.162438 | 600.783578 | 614.306357 |
| 1 | 241.114709 | 241.114709 | 95.210348 | 95.210348 | 0.123807 | 0.105678 | 1196.549122 | 1215.126845 |
| 2 | 200.777546 | 200.777546 | 104.940262 | 104.940262 | 0.095747 | 0.101699 | 1758.902594 | 1780.384923 |
| 3 | 172.891407 | 172.891407 | 114.542178 | 114.542178 | 0.088486 | 0.096843 | 2351.712728 | 2375.241773 |
| 4 | 155.016154 | 155.016154 | 107.527807 | 107.527807 | 0.085085 | 0.162321 | 2996.071669 | 3021.182410 |
HC
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| 0 | 289.558361 | 289.558361 | 107.721196 | 107.721196 | 0.190471 | 0.144765 | 584.730847 | 598.253627 |
| 1 | 198.726073 | 198.726073 | 108.945717 | 108.945717 | 0.139233 | 0.129224 | 1186.405427 | 1204.983150 |
| 2 | 168.380172 | 168.380172 | 132.108710 | 132.108710 | 0.113728 | 0.174578 | 1764.535567 | 1786.017896 |
| 3 | 144.252210 | 144.252210 | 149.995831 | 149.995831 | 0.106327 | 0.144264 | 2370.064258 | 2393.593303 |
| 4 | 130.621383 | 130.621383 | 134.445258 | 134.445258 | 0.101664 | 9.756214 | 3025.633116 | 3050.743857 |
NO2
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| 0 | 242.228626 | 242.228626 | 130.614890 | 130.614890 | 0.173871 | 0.157243 | 610.495193 | 624.017973 |
| 1 | 192.952993 | 192.952993 | 105.891130 | 105.891130 | 0.142404 | 0.148053 | 1250.576662 | 1269.154385 |
| 2 | 161.828087 | 161.828087 | 128.328467 | 128.328467 | 0.115961 | 0.231518 | 1840.769978 | 1862.252307 |
| 3 | 153.345921 | 153.345921 | 182.235857 | 182.235857 | 0.124040 | 0.211636 | 2470.771801 | 2494.300846 |
| 4 | 156.092875 | 156.092875 | 174.958813 | 174.958813 | 0.128878 | 0.233500 | 3210.808485 | 3235.919225 |
O3
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| 0 | 384.880150 | 384.880150 | 222.632837 | 222.632837 | 0.286188 | 0.309980 | 663.157199 | 676.679978 |
| 1 | 257.954444 | 257.954444 | 177.178685 | 177.178685 | 0.218261 | 0.238075 | 1324.891025 | 1343.468747 |
| 2 | 223.504782 | 223.504782 | 208.895546 | 208.895546 | 0.188406 | 1.351125 | 1979.042694 | 2000.525023 |
| 3 | 193.703227 | 193.703227 | 283.748969 | 283.748969 | 0.270236 | 0.233471 | 2641.800826 | 2665.329872 |
| 4 | 179.775801 | 179.775801 | 268.166430 | 268.166430 | 0.243837 | 0.558143 | 3353.105245 | 3378.215986 |
To enhance the model, we will need to find the best order. Although we can find the value of p and q using the autocorrelation and partial autocorrelation plots. We can use loops to look through the best values. We are making us of a function that works quite similar to the GridSearchCV from sklearn.
warnings.filterwarnings("ignore")
scores_df = pd.DataFrame()
p = np.arange(0, 5)
q = np.arange(0, 5)
for i in list(product(p, q)):
gas_df = pd.DataFrame()
i = (i[0], 1, i[1], 7)
print("Computing (p, 1, q, 7):", i)
try:
for j in range(len(GAS_List)):
scores_df[i, GAS_List[j]], _ = model_cv(
SARIMAX, seasonal_order=i, splits=5, exog=x, endog=y[GAS_List[j]]
)
except Exception as e:
print("Error computing (p, 1, q, 7):", i)
print(e)
pass
Computing (p, 1, q, 7): (0, 1, 0, 7) Computing (p, 1, q, 7): (0, 1, 1, 7) Computing (p, 1, q, 7): (0, 1, 2, 7) Computing (p, 1, q, 7): (0, 1, 3, 7) Computing (p, 1, q, 7): (0, 1, 4, 7) Computing (p, 1, q, 7): (1, 1, 0, 7) Computing (p, 1, q, 7): (1, 1, 1, 7) Computing (p, 1, q, 7): (1, 1, 2, 7) Computing (p, 1, q, 7): (1, 1, 3, 7) Computing (p, 1, q, 7): (1, 1, 4, 7) Computing (p, 1, q, 7): (2, 1, 0, 7) Computing (p, 1, q, 7): (2, 1, 1, 7) Computing (p, 1, q, 7): (2, 1, 2, 7) Computing (p, 1, q, 7): (2, 1, 3, 7) Computing (p, 1, q, 7): (2, 1, 4, 7) Computing (p, 1, q, 7): (3, 1, 0, 7) Computing (p, 1, q, 7): (3, 1, 1, 7) Computing (p, 1, q, 7): (3, 1, 2, 7) Computing (p, 1, q, 7): (3, 1, 3, 7) Computing (p, 1, q, 7): (3, 1, 4, 7) Computing (p, 1, q, 7): (4, 1, 0, 7) Computing (p, 1, q, 7): (4, 1, 1, 7) Computing (p, 1, q, 7): (4, 1, 2, 7) Computing (p, 1, q, 7): (4, 1, 3, 7) Computing (p, 1, q, 7): (4, 1, 4, 7)
cv_df = scores_df.T.reset_index()
CO_cv = []
HC_cv = []
NO2_cv = []
O3_cv = []
for i in range(len(cv_df)):
gas = str(cv_df.loc[:, "index"][i]).split("'")[1]
if gas == "CO":
CO_cv.append(cv_df.iloc[i])
elif gas == "HC":
HC_cv.append(cv_df.iloc[i])
elif gas == "NO2":
NO2_cv.append(cv_df.iloc[i])
elif gas == "O3":
O3_cv.append(cv_df.iloc[i])
CO_cv = pd.DataFrame(CO_cv).set_index("index")
HC_cv = pd.DataFrame(HC_cv).set_index("index")
NO2_cv = pd.DataFrame(NO2_cv).set_index("index")
O3_cv = pd.DataFrame(O3_cv).set_index("index")
CO_cv.style.apply(
lambda x: [
"background-color: green; color: white" if v else "" for v in x == x.min()
]
).apply(
lambda x: ["background-color: red; color: white" if v else "" for v in x == x.max()]
)
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| index | ||||||||
| ((0, 1, 0, 7), 'CO') | 228.131572 | 63.315999 | 128.421792 | 38.812400 | 0.132707 | 0.147949 | 1857.444929 | 1869.127514 |
| ((0, 1, 1, 7), 'CO') | 223.671195 | 69.073010 | 109.522914 | 10.523281 | 0.113708 | 0.124360 | 1778.129621 | 1792.732852 |
| ((0, 1, 2, 7), 'CO') | 225.428048 | 72.181050 | 110.864402 | 11.206556 | 0.113412 | 0.128177 | 1779.210109 | 1796.733986 |
| ((0, 1, 3, 7), 'CO') | 224.907143 | 72.157094 | 109.443563 | 15.430673 | 0.113135 | 0.127078 | 1779.978283 | 1800.422807 |
| ((0, 1, 4, 7), 'CO') | 224.496172 | 72.036565 | 107.983794 | 22.782132 | 0.113777 | 0.123609 | 1780.862547 | 1804.227717 |
| ((1, 1, 0, 7), 'CO') | 231.160574 | 68.655002 | 123.811704 | 27.363136 | 0.123708 | 0.152981 | 1817.421143 | 1832.024374 |
| ((1, 1, 1, 7), 'CO') | 225.105872 | 71.584493 | 111.002662 | 11.712933 | 0.113366 | 0.127823 | 1779.289205 | 1796.813083 |
| ((1, 1, 2, 7), 'CO') | 225.449213 | 72.111657 | 110.825014 | 12.226862 | 0.113373 | 0.125796 | 1780.803938 | 1801.248462 |
| ((1, 1, 3, 7), 'CO') | 224.636243 | 71.819237 | 118.665734 | 19.108347 | 0.113640 | 0.136296 | 1781.197262 | 1804.562431 |
| ((1, 1, 4, 7), 'CO') | 223.230865 | 71.996755 | 113.465984 | 28.141337 | 0.113373 | 0.127991 | 1780.580508 | 1806.866324 |
| ((2, 1, 0, 7), 'CO') | 228.962574 | 70.566803 | 117.282038 | 26.481222 | 0.118642 | 0.151812 | 1802.055552 | 1819.579429 |
| ((2, 1, 1, 7), 'CO') | 224.780722 | 72.098853 | 109.191438 | 15.071043 | 0.113067 | 0.126536 | 1779.993805 | 1800.438329 |
| ((2, 1, 2, 7), 'CO') | 224.914515 | 72.371431 | 112.017902 | 11.704601 | 0.113713 | 0.128589 | 1781.211856 | 1804.577026 |
| ((2, 1, 3, 7), 'CO') | 223.151910 | 71.030932 | 121.883034 | 18.578023 | 0.111996 | 0.142105 | 1779.021224 | 1805.307040 |
| ((2, 1, 4, 7), 'CO') | 221.910468 | 72.464594 | 115.711611 | 21.917703 | 0.110207 | 0.121326 | 1776.322500 | 1805.528962 |
| ((3, 1, 0, 7), 'CO') | 227.906999 | 70.936460 | 112.515962 | 27.009884 | 0.116036 | 0.136435 | 1786.944420 | 1807.388944 |
| ((3, 1, 1, 7), 'CO') | 225.171323 | 72.421817 | 109.212215 | 17.884477 | 0.113389 | 0.124179 | 1781.293591 | 1804.658761 |
| ((3, 1, 2, 7), 'CO') | 223.483579 | 70.314740 | 104.487037 | 16.154343 | 0.113270 | 0.117586 | 1781.323106 | 1807.608922 |
| ((3, 1, 3, 7), 'CO') | 221.639246 | 71.873940 | 103.973338 | 13.892593 | 0.110058 | 0.107357 | 1775.885039 | 1805.091501 |
| ((3, 1, 4, 7), 'CO') | 220.606223 | 70.421718 | 115.806231 | 31.390047 | 0.111397 | 0.120749 | 1781.940525 | 1814.067633 |
| ((4, 1, 0, 7), 'CO') | 227.898600 | 70.617940 | 112.832028 | 26.522744 | 0.116146 | 0.133622 | 1788.046868 | 1811.412037 |
| ((4, 1, 1, 7), 'CO') | 225.726937 | 71.960333 | 105.128289 | 21.917314 | 0.114462 | 0.118740 | 1782.934850 | 1809.220666 |
| ((4, 1, 2, 7), 'CO') | 223.739077 | 71.370409 | 108.552221 | 8.846064 | 0.112847 | 0.123601 | 1782.692363 | 1811.898825 |
| ((4, 1, 3, 7), 'CO') | 221.196933 | 70.949024 | 108.350520 | 7.054516 | 0.110991 | 0.121021 | 1779.437155 | 1811.564263 |
| ((4, 1, 4, 7), 'CO') | 221.669877 | 71.624652 | 103.685466 | 13.898846 | 0.110258 | 0.112606 | 1781.790517 | 1816.838271 |
HC_cv.style.apply(
lambda x: [
"background-color: green; color: white" if v else "" for v in x == x.min()
]
).apply(
lambda x: ["background-color: red; color: white" if v else "" for v in x == x.max()]
)
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| index | ||||||||
| ((0, 1, 0, 7), 'HC') | 187.888210 | 46.713044 | 138.813859 | 49.711782 | 0.153325 | 2.709477 | 1852.095435 | 1863.778020 |
| ((0, 1, 1, 7), 'HC') | 185.324718 | 54.741757 | 123.698528 | 18.395170 | 0.130318 | 2.235863 | 1783.799957 | 1798.403188 |
| ((0, 1, 2, 7), 'HC') | 186.056248 | 56.659403 | 125.877306 | 15.959083 | 0.130901 | 2.060544 | 1784.980224 | 1802.504101 |
| ((0, 1, 3, 7), 'HC') | 185.183160 | 54.979182 | 124.728737 | 17.981489 | 0.129856 | 2.089617 | 1785.934340 | 1806.378864 |
| ((0, 1, 4, 7), 'HC') | 185.287201 | 54.940296 | 126.090411 | 18.556894 | 0.129722 | 2.062415 | 1787.850562 | 1811.215732 |
| ((1, 1, 0, 7), 'HC') | 190.455731 | 52.711802 | 139.370612 | 40.324136 | 0.143275 | 3.918523 | 1815.085768 | 1829.688999 |
| ((1, 1, 1, 7), 'HC') | 186.197279 | 56.739316 | 126.284514 | 15.727005 | 0.130708 | 2.069653 | 1784.561847 | 1802.085725 |
| ((1, 1, 2, 7), 'HC') | 186.307640 | 56.579045 | 126.643342 | 16.169925 | 0.130285 | 2.069809 | 1786.273843 | 1806.718367 |
| ((1, 1, 3, 7), 'HC') | 185.340985 | 54.953232 | 124.884819 | 18.084280 | 0.129739 | 2.059509 | 1787.559950 | 1810.925119 |
| ((1, 1, 4, 7), 'HC') | 185.220512 | 55.138635 | 125.561637 | 18.848237 | 0.129670 | 2.176157 | 1789.236290 | 1815.522106 |
| ((2, 1, 0, 7), 'HC') | 189.636744 | 54.767212 | 134.290362 | 34.995309 | 0.138566 | 4.319168 | 1803.574336 | 1821.098213 |
| ((2, 1, 1, 7), 'HC') | 185.510988 | 55.494084 | 125.284543 | 17.352300 | 0.129907 | 2.086320 | 1786.190034 | 1806.634557 |
| ((2, 1, 2, 7), 'HC') | 185.706144 | 55.524760 | 125.248931 | 17.247742 | 0.129896 | 2.028909 | 1787.811608 | 1811.176778 |
| ((2, 1, 3, 7), 'HC') | 184.846129 | 55.060836 | 128.948892 | 23.857246 | 0.128604 | 1.525290 | 1786.162814 | 1812.448630 |
| ((2, 1, 4, 7), 'HC') | 185.340003 | 55.510409 | 133.100166 | 26.413087 | 0.129216 | 1.599057 | 1789.257878 | 1818.464340 |
| ((3, 1, 0, 7), 'HC') | 189.123614 | 55.303619 | 131.605433 | 32.769075 | 0.135731 | 3.125555 | 1797.644559 | 1818.089082 |
| ((3, 1, 1, 7), 'HC') | 186.281531 | 56.799573 | 126.372665 | 16.003861 | 0.130759 | 2.083197 | 1788.455160 | 1811.820330 |
| ((3, 1, 2, 7), 'HC') | 186.069699 | 56.407474 | 125.910128 | 17.210104 | 0.129969 | 2.046916 | 1789.634288 | 1815.920104 |
| ((3, 1, 3, 7), 'HC') | 184.001937 | 53.249239 | 132.603292 | 28.699781 | 0.128587 | 2.412150 | 1787.685834 | 1816.892296 |
| ((3, 1, 4, 7), 'HC') | 184.356274 | 55.206366 | 129.499646 | 24.328935 | 0.128007 | 1.687056 | 1788.736591 | 1820.863700 |
| ((4, 1, 0, 7), 'HC') | 188.941519 | 55.122143 | 130.381511 | 31.840348 | 0.133918 | 2.660089 | 1798.128086 | 1821.493256 |
| ((4, 1, 1, 7), 'HC') | 184.871144 | 54.132914 | 122.806123 | 18.285330 | 0.129472 | 1.925421 | 1789.461946 | 1815.747761 |
| ((4, 1, 2, 7), 'HC') | 184.423009 | 54.078515 | 129.228202 | 24.362280 | 0.128899 | 2.282933 | 1789.096918 | 1818.303380 |
| ((4, 1, 3, 7), 'HC') | 184.767112 | 55.024619 | 130.273414 | 25.539292 | 0.128888 | 2.166164 | 1788.778815 | 1820.905923 |
| ((4, 1, 4, 7), 'HC') | 184.563713 | 54.016950 | 125.686472 | 20.494043 | 0.128869 | 1.634376 | 1793.296957 | 1828.344711 |
NO2_cv.style.apply(
lambda x: [
"background-color: green; color: white" if v else "" for v in x == x.min()
]
).apply(
lambda x: ["background-color: red; color: white" if v else "" for v in x == x.max()]
)
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| index | ||||||||
| ((0, 1, 0, 7), 'NO2') | 205.612829 | 42.955073 | 154.028498 | 73.042990 | 0.165474 | 0.236603 | 1955.599939 | 1967.282524 |
| ((0, 1, 1, 7), 'NO2') | 185.120992 | 39.187714 | 136.994253 | 32.810299 | 0.140655 | 0.193583 | 1877.618445 | 1892.221676 |
| ((0, 1, 2, 7), 'NO2') | 182.261484 | 34.906788 | 142.641262 | 28.385614 | 0.137306 | 0.199030 | 1876.671514 | 1894.195391 |
| ((0, 1, 3, 7), 'NO2') | 181.429633 | 33.438616 | 140.233218 | 30.012088 | 0.136167 | 0.192675 | 1877.215374 | 1897.659898 |
| ((0, 1, 4, 7), 'NO2') | 181.615254 | 33.419512 | 140.101536 | 31.699111 | 0.136222 | 0.191649 | 1878.901266 | 1902.266436 |
| ((1, 1, 0, 7), 'NO2') | 188.623185 | 33.378022 | 141.819786 | 50.812222 | 0.147337 | 0.203625 | 1908.169127 | 1922.772358 |
| ((1, 1, 1, 7), 'NO2') | 181.830952 | 34.222318 | 140.180030 | 29.416779 | 0.137239 | 0.196094 | 1876.320593 | 1893.844470 |
| ((1, 1, 2, 7), 'NO2') | 181.289700 | 33.597840 | 144.405832 | 29.314329 | 0.137031 | 0.196390 | 1876.684424 | 1897.128947 |
| ((1, 1, 3, 7), 'NO2') | 181.991359 | 34.656093 | 139.033517 | 34.081733 | 0.137123 | 0.190073 | 1878.102221 | 1901.467391 |
| ((1, 1, 4, 7), 'NO2') | 181.645464 | 33.784329 | 141.556942 | 30.253418 | 0.136392 | 0.193653 | 1880.525479 | 1906.811294 |
| ((2, 1, 0, 7), 'NO2') | 185.576447 | 32.062612 | 135.371655 | 44.470979 | 0.142648 | 0.189829 | 1893.630767 | 1911.154644 |
| ((2, 1, 1, 7), 'NO2') | 182.691318 | 35.861532 | 139.227217 | 30.467083 | 0.137497 | 0.193126 | 1877.959148 | 1898.403672 |
| ((2, 1, 2, 7), 'NO2') | 181.337611 | 33.472576 | 143.886281 | 30.389878 | 0.136868 | 0.196378 | 1878.012209 | 1901.377379 |
| ((2, 1, 3, 7), 'NO2') | 182.206026 | 34.090125 | 134.539893 | 34.686731 | 0.136997 | 0.185557 | 1879.383724 | 1905.669539 |
| ((2, 1, 4, 7), 'NO2') | 181.430601 | 33.485862 | 139.333331 | 32.665622 | 0.136507 | 0.188924 | 1882.002386 | 1911.208848 |
| ((3, 1, 0, 7), 'NO2') | 184.027092 | 32.689725 | 134.706566 | 39.963405 | 0.139553 | 0.182798 | 1886.189246 | 1906.633770 |
| ((3, 1, 1, 7), 'NO2') | 181.299734 | 33.222922 | 142.603462 | 29.313602 | 0.136504 | 0.195919 | 1879.192377 | 1902.557547 |
| ((3, 1, 2, 7), 'NO2') | 181.332374 | 33.511148 | 142.422472 | 29.137486 | 0.136838 | 0.195597 | 1881.044732 | 1907.330548 |
| ((3, 1, 3, 7), 'NO2') | 181.775335 | 33.649490 | 142.982974 | 31.388092 | 0.137447 | 0.198200 | 1882.507295 | 1911.713757 |
| ((3, 1, 4, 7), 'NO2') | 182.482631 | 35.005413 | 138.845505 | 36.965086 | 0.136008 | 0.185298 | 1881.659854 | 1913.786962 |
| ((4, 1, 0, 7), 'NO2') | 183.472962 | 32.701517 | 137.640105 | 40.296029 | 0.138995 | 0.187673 | 1886.252263 | 1909.617433 |
| ((4, 1, 1, 7), 'NO2') | 181.277959 | 33.207240 | 142.393518 | 25.719016 | 0.136818 | 0.196457 | 1881.753010 | 1908.038826 |
| ((4, 1, 2, 7), 'NO2') | 181.519381 | 34.014348 | 141.837902 | 29.261124 | 0.136382 | 0.194368 | 1882.475287 | 1911.681749 |
| ((4, 1, 3, 7), 'NO2') | 181.954211 | 33.784323 | 148.569741 | 27.653710 | 0.135951 | 0.204590 | 1881.861844 | 1913.988952 |
| ((4, 1, 4, 7), 'NO2') | 182.199968 | 34.246421 | 140.950970 | 32.326849 | 0.135327 | 0.189123 | 1882.535941 | 1917.583695 |
O3_cv.style.apply(
lambda x: [
"background-color: green; color: white" if v else "" for v in x == x.min()
]
).apply(
lambda x: ["background-color: red; color: white" if v else "" for v in x == x.max()]
)
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| index | ||||||||
| ((0, 1, 0, 7), 'O3') | 262.473954 | 59.098881 | 252.466755 | 77.922163 | 0.428828 | 0.816818 | 2066.394508 | 2078.077093 |
| ((0, 1, 1, 7), 'O3') | 246.447939 | 70.922455 | 226.599269 | 40.820186 | 0.239829 | 0.712715 | 1990.231315 | 2004.834546 |
| ((0, 1, 2, 7), 'O3') | 248.205218 | 74.417094 | 231.451621 | 39.934104 | 0.237166 | 0.716739 | 1991.397690 | 2008.921567 |
| ((0, 1, 3, 7), 'O3') | 246.911583 | 71.720709 | 229.323506 | 40.210739 | 0.240089 | 0.663327 | 1992.754260 | 2013.198784 |
| ((0, 1, 4, 7), 'O3') | 246.542373 | 71.182489 | 230.423171 | 40.102290 | 0.239072 | 0.668178 | 1994.594936 | 2017.960106 |
| ((1, 1, 0, 7), 'O3') | 258.837034 | 64.922677 | 250.817675 | 59.579531 | 0.392319 | 0.877001 | 2024.273375 | 2038.876606 |
| ((1, 1, 1, 7), 'O3') | 247.693668 | 73.394681 | 230.869158 | 40.034501 | 0.235595 | 0.716089 | 1991.171569 | 2008.695446 |
| ((1, 1, 2, 7), 'O3') | 247.963681 | 73.536326 | 232.124494 | 39.020059 | 0.241386 | 0.538159 | 1992.399398 | 2012.843921 |
| ((1, 1, 3, 7), 'O3') | 246.619219 | 71.258891 | 231.417522 | 41.620296 | 0.232134 | 0.521243 | 1993.723375 | 2017.088544 |
| ((1, 1, 4, 7), 'O3') | 246.049041 | 72.679535 | 222.657105 | 48.893653 | 0.225439 | 0.692109 | 1995.544855 | 2021.830671 |
| ((2, 1, 0, 7), 'O3') | 257.064991 | 69.899385 | 243.105622 | 53.182756 | 0.304542 | 0.639548 | 2012.284475 | 2029.808352 |
| ((2, 1, 1, 7), 'O3') | 248.149096 | 74.019116 | 231.034459 | 39.677734 | 0.240888 | 0.669583 | 1992.904278 | 2013.348801 |
| ((2, 1, 2, 7), 'O3') | 248.405378 | 74.707722 | 232.135597 | 40.963044 | 0.234942 | 0.516172 | 1994.259735 | 2017.624905 |
| ((2, 1, 3, 7), 'O3') | 246.262801 | 71.503521 | 237.847864 | 34.517209 | 0.220500 | 0.544798 | 1992.572295 | 2018.858111 |
| ((2, 1, 4, 7), 'O3') | 246.250953 | 72.318951 | 229.641347 | 40.743348 | 0.225985 | 0.500451 | 1995.527131 | 2024.733593 |
| ((3, 1, 0, 7), 'O3') | 255.168222 | 71.063623 | 238.900574 | 53.338804 | 0.281317 | 0.547307 | 2005.207326 | 2025.651850 |
| ((3, 1, 1, 7), 'O3') | 249.548091 | 75.072890 | 218.129719 | 54.580898 | 0.242425 | 0.762376 | 1995.939358 | 2019.304527 |
| ((3, 1, 2, 7), 'O3') | 247.420064 | 73.325646 | 232.969138 | 40.358083 | 0.226735 | 0.529388 | 1995.544577 | 2021.830393 |
| ((3, 1, 3, 7), 'O3') | 247.600504 | 73.722550 | 233.761843 | 41.249593 | 0.267794 | 0.530910 | 1996.215201 | 2025.421663 |
| ((3, 1, 4, 7), 'O3') | 245.881184 | 71.215971 | 231.756476 | 40.105820 | 0.216103 | 0.491553 | 1997.213550 | 2029.340659 |
| ((4, 1, 0, 7), 'O3') | 255.172942 | 71.508551 | 238.847559 | 51.879745 | 0.236840 | 0.425928 | 2005.795285 | 2029.160454 |
| ((4, 1, 1, 7), 'O3') | 248.652822 | 72.961042 | 218.039516 | 55.308579 | 0.240964 | 0.790645 | 1997.871392 | 2024.157207 |
| ((4, 1, 2, 7), 'O3') | 247.675924 | 73.821840 | 231.192691 | 40.433339 | 0.223905 | 0.507086 | 1997.144634 | 2026.351096 |
| ((4, 1, 3, 7), 'O3') | 243.155875 | 71.382697 | 227.039828 | 41.031395 | 0.234690 | 0.430303 | 1994.329345 | 2026.456454 |
| ((4, 1, 4, 7), 'O3') | 244.681029 | 71.130034 | 226.715002 | 47.652697 | 0.229659 | 0.301560 | 1997.474603 | 2032.522357 |
Observations
Overall, we can use the following order to go our predictions
SARIMA_GAS_train_pred = [CO_train_pred, HC_train_pred, NO2_train_pred, O3_train_pred]
CO_model = SARIMAX(endog=y_train["CO"], exog=x_train, seasonal_order=(3, 1, 3, 7)).fit()
HC_model = SARIMAX(endog=y_train["HC"], exog=x_train, seasonal_order=(0, 1, 1, 7)).fit()
NO2_model = SARIMAX(
endog=y_train["NO2"], exog=x_train, seasonal_order=(3, 1, 0, 7)
).fit()
O3_model = SARIMAX(endog=y_train["O3"], exog=x_train, seasonal_order=(4, 1, 1, 7)).fit()
CO_train_pred = CO_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
HC_train_pred = HC_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
NO2_train_pred = NO2_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
O3_train_pred = O3_model.predict(
start=x_train.index[0], end=x_train.index[-1], exog=x_train
)
CO_pred = CO_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
HC_pred = HC_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
NO2_pred = NO2_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
O3_pred = O3_model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
SARIMA_GAS_train_pred = [CO_train_pred, HC_train_pred, NO2_train_pred, O3_train_pred]
SARIMA_GAS_pred = [CO_pred, HC_pred, NO2_pred, O3_pred]
for i in range(len(GAS_List)):
print(
f"{GAS_List[i]} Train MAPE:\t\t{mean_absolute_percentage_error(y_train[GAS_List[i]],SARIMA_GAS_train_pred[i]) * 100:.3f}%"
)
print(
f"{GAS_List[i]} Train RMSE:\t\t{mean_squared_error(y_train[GAS_List[i]],SARIMA_GAS_train_pred[i],squared=False):.3f}"
)
print(
f"{GAS_List[i]} Validation MAPE:\t{mean_absolute_percentage_error(y_test[GAS_List[i]],SARIMA_GAS_pred[i]) * 100:.3f}%"
)
print(
f"{GAS_List[i]} Validation RMSE:\t{mean_squared_error(y_test[GAS_List[i]],SARIMA_GAS_pred[i],squared=False):.3f}"
)
print()
CO Train MAPE: 8.637% CO Train RMSE: 156.107 CO Validation MAPE: 15.741% CO Validation RMSE: 108.774 HC Train MAPE: 10.301% HC Train RMSE: 132.532 HC Validation MAPE: 925.182% HC Validation RMSE: 135.916 NO2 Train MAPE: 12.811% NO2 Train RMSE: 158.264 NO2 Validation MAPE: 24.552% NO2 Validation RMSE: 198.951 O3 Train MAPE: 25.746% O3 Train RMSE: 181.409 O3 Validation MAPE: 48.880% O3 Validation RMSE: 251.127
# CO_model = SARIMAX(endog=y["CO"], exog=x, seasonal_order=(3, 1, 3, 7)).fit()
# HC_model = SARIMAX(endog=y["HC"], exog=x, seasonal_order=(0, 1, 1, 7)).fit()
# NO2_model = SARIMAX(endog=y["NO2"], exog=x, seasonal_order=(3, 1, 0, 7)).fit()
# O3_model = SARIMAX(endog=y["O3"], exog=x, seasonal_order=(4, 1, 1, 7)).fit()
# CO_pred = CO_model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
# HC_pred = HC_model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
# NO2_pred = NO2_model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
# O3_pred = O3_model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
# GAS_pred = [CO_pred, HC_pred, NO2_pred, O3_pred]
# pd.DataFrame(np.array(GAS_pred).flatten()).reset_index().rename({'index':"id",0: "Value"}, axis=1).to_csv(
# "./kaggle/sarima2.csv",index=False
# )
# print("Exporting to csv")
As mentioned previously, VARMAX is able to handle multiple data and time series together. Furthermore, we have shown in the Causality and Cointegration Test at least 2 time series influence each other. Therefore we will use a multivariate regressor like VARMAX
model = VARMAX(endog=y_train, exog=x_train, order=(1, 1)).fit(maxiter=1000, disp=False)
model.summary()
| Dep. Variable: | ['CO', 'HC', 'NO2', 'O3'] | No. Observations: | 262 |
|---|---|---|---|
| Model: | VARMAX(1,1) | Log Likelihood | -5721.890 |
| + intercept | AIC | 11551.779 | |
| Date: | Thu, 11 Aug 2022 | BIC | 11744.470 |
| Time: | 19:42:26 | HQIC | 11629.226 |
| Sample: | 03-15-2016 | ||
| - 12-01-2016 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 5.22, 11.87, 10.35, 21.32 | Jarque-Bera (JB): | 1.80, 0.43, 363.43, 66.88 |
|---|---|---|---|
| Prob(Q): | 0.02, 0.00, 0.00, 0.00 | Prob(JB): | 0.41, 0.81, 0.00, 0.00 |
| Heteroskedasticity (H): | 1.44, 0.96, 1.45, 2.04 | Skew: | -0.05, 0.09, 0.85, 0.73 |
| Prob(H) (two-sided): | 0.09, 0.83, 0.09, 0.00 | Kurtosis: | 3.39, 3.09, 8.51, 4.99 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 115.1368 | 528.098 | 0.218 | 0.827 | -919.915 | 1150.189 |
| L1.CO | -5.4921 | 5.617 | -0.978 | 0.328 | -16.501 | 5.517 |
| L1.HC | -0.5681 | 0.629 | -0.903 | 0.367 | -1.802 | 0.665 |
| L1.NO2 | 3.0909 | 2.949 | 1.048 | 0.295 | -2.690 | 8.872 |
| L1.O3 | 4.6699 | 4.042 | 1.155 | 0.248 | -3.252 | 12.591 |
| L1.e(CO) | 5.7633 | 5.610 | 1.027 | 0.304 | -5.231 | 16.758 |
| L1.e(HC) | 0.3010 | 0.601 | 0.501 | 0.616 | -0.876 | 1.478 |
| L1.e(NO2) | -2.8151 | 2.973 | -0.947 | 0.344 | -8.642 | 3.011 |
| L1.e(O3) | -4.1302 | 4.056 | -1.018 | 0.309 | -12.079 | 3.819 |
| beta.T | 1.6007 | 0.583 | 2.747 | 0.006 | 0.458 | 2.743 |
| beta.RH | 2.8840 | 0.475 | 6.075 | 0.000 | 1.954 | 3.815 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 303.4175 | 343.502 | 0.883 | 0.377 | -369.834 | 976.669 |
| L1.CO | -3.2589 | 3.690 | -0.883 | 0.377 | -10.491 | 3.973 |
| L1.HC | 0.0374 | 0.426 | 0.088 | 0.930 | -0.797 | 0.872 |
| L1.NO2 | 1.6220 | 1.930 | 0.841 | 0.401 | -2.160 | 5.404 |
| L1.O3 | 2.5675 | 2.629 | 0.977 | 0.329 | -2.585 | 7.720 |
| L1.e(CO) | 2.8220 | 3.702 | 0.762 | 0.446 | -4.433 | 10.077 |
| L1.e(HC) | 0.1063 | 0.421 | 0.253 | 0.801 | -0.719 | 0.931 |
| L1.e(NO2) | -1.5010 | 1.972 | -0.761 | 0.447 | -5.367 | 2.365 |
| L1.e(O3) | -1.9887 | 2.666 | -0.746 | 0.456 | -7.215 | 3.237 |
| beta.T | 2.0944 | 0.696 | 3.010 | 0.003 | 0.731 | 3.458 |
| beta.RH | 1.9269 | 0.572 | 3.370 | 0.001 | 0.806 | 3.048 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 48.1959 | 602.731 | 0.080 | 0.936 | -1133.135 | 1229.526 |
| L1.CO | -5.3333 | 6.317 | -0.844 | 0.399 | -17.714 | 7.047 |
| L1.HC | -0.7126 | 0.750 | -0.950 | 0.342 | -2.183 | 0.758 |
| L1.NO2 | 3.3665 | 3.243 | 1.038 | 0.299 | -2.990 | 9.723 |
| L1.O3 | 4.3382 | 4.404 | 0.985 | 0.325 | -4.294 | 12.970 |
| L1.e(CO) | 6.6596 | 6.356 | 1.048 | 0.295 | -5.798 | 19.117 |
| L1.e(HC) | 0.7619 | 0.831 | 0.917 | 0.359 | -0.867 | 2.391 |
| L1.e(NO2) | -3.0224 | 3.268 | -0.925 | 0.355 | -9.427 | 3.382 |
| L1.e(O3) | -5.1563 | 4.423 | -1.166 | 0.244 | -13.824 | 3.512 |
| beta.T | 2.2185 | 0.769 | 2.886 | 0.004 | 0.712 | 3.725 |
| beta.RH | 0.8024 | 0.656 | 1.224 | 0.221 | -0.483 | 2.087 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 217.3890 | 502.515 | 0.433 | 0.665 | -767.521 | 1202.299 |
| L1.CO | -3.6744 | 5.378 | -0.683 | 0.494 | -14.216 | 6.867 |
| L1.HC | -0.3599 | 0.670 | -0.537 | 0.591 | -1.672 | 0.953 |
| L1.NO2 | 1.8498 | 2.793 | 0.662 | 0.508 | -3.624 | 7.324 |
| L1.O3 | 3.3313 | 3.785 | 0.880 | 0.379 | -4.087 | 10.750 |
| L1.e(CO) | 3.1047 | 5.432 | 0.572 | 0.568 | -7.543 | 13.752 |
| L1.e(HC) | -0.0185 | 0.707 | -0.026 | 0.979 | -1.404 | 1.367 |
| L1.e(NO2) | -1.6876 | 2.886 | -0.585 | 0.559 | -7.343 | 3.968 |
| L1.e(O3) | -2.0109 | 3.860 | -0.521 | 0.602 | -9.577 | 5.555 |
| beta.T | 0.7086 | 1.207 | 0.587 | 0.557 | -1.657 | 3.075 |
| beta.RH | 3.0684 | 0.968 | 3.170 | 0.002 | 1.171 | 4.966 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.CO | 68.2907 | 4.427 | 15.424 | 0.000 | 59.613 | 76.968 |
| sqrt.cov.CO.HC | 73.8049 | 6.529 | 11.304 | 0.000 | 61.008 | 86.602 |
| sqrt.var.HC | 40.6955 | 2.409 | 16.892 | 0.000 | 35.974 | 45.418 |
| sqrt.cov.CO.NO2 | -75.8759 | 7.149 | -10.613 | 0.000 | -89.888 | -61.864 |
| sqrt.cov.HC.NO2 | -16.7579 | 5.636 | -2.974 | 0.003 | -27.804 | -5.712 |
| sqrt.var.NO2 | 60.5867 | 3.234 | 18.732 | 0.000 | 54.248 | 66.926 |
| sqrt.cov.CO.O3 | 127.5987 | 10.243 | 12.457 | 0.000 | 107.522 | 147.675 |
| sqrt.cov.HC.O3 | 39.2484 | 4.155 | 9.447 | 0.000 | 31.105 | 47.392 |
| sqrt.cov.NO2.O3 | -2.8691 | 4.534 | -0.633 | 0.527 | -11.756 | 6.018 |
| sqrt.var.O3 | 44.4336 | 1.815 | 24.487 | 0.000 | 40.877 | 47.990 |
model.plot_diagnostics()
plt.show()
Observations
Residual plot is not very varied, the Histogram of residual states that the residuals are quite normally distributed. In the correlogram plot, Lag 1 and 2 appears to be outside the range of 5% which means that the model might require some tuning
train_pred = model.predict(start=x_train.index[0], end=x_train.index[-1], exog=x_train)
valid_pred = model.predict(start=x_test.index[0], end=x_test.index[-1], exog=x_test)
print(
"Train MAPE:\t\t{:.3f}%".format(
mean_absolute_percentage_error(y_train, train_pred) * 100
)
)
print(
"Train RMSE:\t\t{:.3f}".format(
mean_squared_error(y_train, train_pred, squared=False)
)
)
print(
"Validation MAPE:\t{:.3f}%".format(
mean_absolute_percentage_error(y_test, valid_pred) * 100
)
)
print(
"Validation RMSE:\t{:.3f}".format(
mean_squared_error(y_test, valid_pred, squared=False)
)
)
Train MAPE: 12.823% Train RMSE: 98.433 Validation MAPE: 126.684% Validation RMSE: 162.732
Observations
fig, ax = plt.subplots()
valid_pred.plot(ax=ax, alpha=0.9)
y_test.plot(alpha=0.5, ax=ax, linestyle="--")
ax.set_title("Comparison between Predictions and Ground Truth")
plt.show()
Observations The predict values do not have the splits and dips that the time series have which can means that the order selected was not the best. More needs to be done to make the model better
avg, _ = model_cv(VARMAX, exog=x, endog=y, splits=5, order=(1, 1))
avg
train_rmse 115.899651 train_rmse_std 22.994974 valid_rmse 176.237147 valid_rmse_std 34.414391 train_mape 0.237945 valid_mape 0.864523 AIC 7400.438812 BIC 7561.333814 Name: (1, 1), dtype: float64
Observations
The order (1,1) has a good valid rmse and train rmse as it is close to 0. We note that VARMAX models take a lot of penalties for AIC and BIC due to it being a more complex model and takes more computational power compared to SARIMA and ARIMA.
scores_df = pd.DataFrame()
p = np.arange(0, 2)
q = np.arange(0, 9)
for i in list(product(p, q)):
print("Computing (p, q):", i)
try:
scores_df[i], _ = model_cv(
VARMAX, order=i, splits=5, exog=x, endog=y, max_iter=100
)
except Exception as e:
print("Error computing (p, q):", i)
print(e)
pass
Computing (p, q): (0, 0) Error computing (p, q): (0, 0) Invalid VARMAX(p,q) specification; at least one p,q must be greater than zero. Computing (p, q): (0, 1) Computing (p, q): (0, 2) Computing (p, q): (0, 3) Computing (p, q): (0, 4) Computing (p, q): (0, 5) Computing (p, q): (0, 6) Computing (p, q): (0, 7) Computing (p, q): (0, 8) Computing (p, q): (1, 0) Computing (p, q): (1, 1) Computing (p, q): (1, 2) Computing (p, q): (1, 3) Computing (p, q): (1, 4) Computing (p, q): (1, 5) Computing (p, q): (1, 6) Computing (p, q): (1, 7) Computing (p, q): (1, 8)
scores_df.T.style.apply(
lambda x: [
"background-color: green; color: white" if v else "" for v in x == x.min()
]
).apply(
lambda x: ["background-color: red; color: white" if v else "" for v in x == x.max()]
)
| train_rmse | train_rmse_std | valid_rmse | valid_rmse_std | train_mape | valid_mape | AIC | BIC | |
|---|---|---|---|---|---|---|---|---|
| (0, 1) | 93.002998 | 5.984985 | 149.322418 | 20.420019 | 0.119277 | 0.341591 | 7265.277936 | 7378.500344 |
| (0, 2) | 89.894723 | 5.060442 | 148.138933 | 21.556525 | 0.108975 | 0.268482 | 7210.076954 | 7370.971955 |
| (0, 3) | 87.454531 | 5.161977 | 147.877021 | 23.014975 | 0.124809 | 0.366505 | 7198.910173 | 7407.477768 |
| (0, 4) | 85.775759 | 6.370535 | 147.983182 | 23.274048 | 0.115429 | 0.407572 | 7203.596825 | 7459.837013 |
| (0, 5) | 86.101466 | 5.380451 | 149.164965 | 23.486088 | 0.122183 | 0.524566 | 7211.172464 | 7515.085244 |
| (0, 6) | 84.415567 | 5.695552 | 148.354794 | 21.899687 | 0.109404 | 0.410920 | 7224.695291 | 7576.280665 |
| (0, 7) | 81.483808 | 5.960166 | 147.263737 | 22.531687 | 0.110160 | 0.419208 | 7219.963276 | 7619.221243 |
| (0, 8) | 80.425377 | 6.208739 | 149.015601 | 23.068919 | 0.093411 | 0.376010 | 7246.019678 | 7692.950237 |
| (1, 0) | 158.196426 | 11.585930 | 412.602575 | 191.908196 | 0.347873 | 7.344966 | 8309.787134 | 8423.009543 |
| (1, 1) | 121.019752 | 26.809005 | 191.197882 | 23.561448 | 0.270154 | 1.271650 | 7544.308337 | 7705.203338 |
| (1, 2) | 114.368379 | 18.039646 | 195.553870 | 23.238892 | 0.286826 | 1.729729 | 7382.320308 | 7590.887902 |
| (1, 3) | 116.987539 | 24.907146 | 213.782802 | 75.596954 | 0.318983 | 2.305007 | 7413.075070 | 7669.315257 |
| (1, 4) | 111.762610 | 14.711369 | 204.025143 | 48.382068 | 0.328935 | 2.840975 | 7458.830950 | 7762.743730 |
| (1, 5) | 138.268160 | 40.031101 | 381.715221 | 200.632227 | 0.336641 | 2.836224 | 7574.868847 | 7926.454220 |
| (1, 6) | 121.886049 | 27.000724 | 213.582887 | 39.151251 | 0.324774 | 2.986266 | 7559.088203 | 7958.346169 |
| (1, 7) | 129.453995 | 31.606046 | 402.582310 | 176.558449 | 0.326138 | 2.749244 | 7669.510859 | 8116.441419 |
| (1, 8) | 126.038263 | 42.463538 | 454.537465 | 432.270181 | 0.334832 | 2.353588 | 7651.635889 | 8146.239042 |
Observations
The order (0,7) performed the best as it has the lowest valid rmse. We also note that by increasing the p value (Auto Regressive) makes the score worst as the valid rmse increase drastically when it is added. Order (0,8) is also relatively good as it has the lowest train rmse and train_mape.
Based on the valid rmse, generally VARMAX performs more effectively compared to SARIMA. This is likely due to the correlation between the different variables that VARMAX is able to use in assistance to the model which means that we will be using VARMAX as the main model.
After we are satisfied with the model performance on the train and validation set, it is about time for us to generate the final prediction for the test set and submit it on the Kaggle competition page.
df_test = pd.read_csv("./test.csv").drop("id", axis=1)
df_test["Date"] = pd.to_datetime(df_test["Date"], format="%d/%m/%Y")
df_test
| Date | T | RH | Gas | Unnamed: 5 | Unnamed: 6 | |
|---|---|---|---|---|---|---|
| 0 | 2017-02-06 | 6.616667 | 51.734375 | CO | NaN | NaN |
| 1 | 2017-02-07 | 7.613194 | 43.930903 | CO | NaN | NaN |
| 2 | 2017-02-08 | 7.252083 | 50.966667 | CO | NaN | NaN |
| 3 | 2017-02-09 | 7.473611 | 50.166319 | CO | NaN | NaN |
| 4 | 2017-02-10 | 5.571875 | 46.604167 | CO | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 247 | 2017-04-05 | 17.554167 | 50.092708 | O3 | NaN | NaN |
| 248 | 2017-04-06 | 15.919792 | 35.959722 | O3 | NaN | NaN |
| 249 | 2017-04-07 | 15.489583 | 32.213542 | O3 | NaN | NaN |
| 250 | 2017-04-08 | 18.381250 | 33.686458 | O3 | NaN | NaN |
| 251 | 2017-04-09 | 16.966667 | 42.791667 | O3 | NaN | NaN |
252 rows × 6 columns
# prepare exogenous testing set
X_test = df_test.set_index("Date")[["T", "RH"]].resample("d").mean().to_period("D")
X_test
| T | RH | |
|---|---|---|
| Date | ||
| 2017-02-06 | 6.616667 | 51.734375 |
| 2017-02-07 | 7.613194 | 43.930903 |
| 2017-02-08 | 7.252083 | 50.966667 |
| 2017-02-09 | 7.473611 | 50.166319 |
| 2017-02-10 | 5.571875 | 46.604167 |
| ... | ... | ... |
| 2017-04-05 | 17.554167 | 50.092708 |
| 2017-04-06 | 15.919792 | 35.959722 |
| 2017-04-07 | 15.489583 | 32.213542 |
| 2017-04-08 | 18.381250 | 33.686458 |
| 2017-04-09 | 16.966667 | 42.791667 |
63 rows × 2 columns
model = VARMAX(endog=y, exog=x, order=(0, 7), trend="c").fit(maxiter=1000, disp=False)
# making a directory if it does not exist
Path("./models").mkdir(parents=True, exist_ok=True)
# saving model to pickle
with open("./models/varmax07.pkl", "wb") as f:
pickle.dump(model, f)
print("Model saved!")
model.summary()
Model saved!
| Dep. Variable: | ['CO', 'HC', 'NO2', 'O3'] | No. Observations: | 328 |
|---|---|---|---|
| Model: | VMAX(7) | Log Likelihood | -7098.015 |
| + intercept | AIC | 14464.031 | |
| Date: | Thu, 11 Aug 2022 | BIC | 14972.295 |
| Time: | 22:37:35 | HQIC | 14666.814 |
| Sample: | 03-15-2016 | ||
| - 02-05-2017 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 4.93, 19.73, 4.18, 7.84 | Jarque-Bera (JB): | 11.03, 0.17, 1057.23, 32.33 |
|---|---|---|---|
| Prob(Q): | 0.03, 0.00, 0.04, 0.01 | Prob(JB): | 0.00, 0.92, 0.00, 0.00 |
| Heteroskedasticity (H): | 1.79, 0.87, 1.88, 2.39 | Skew: | -0.21, -0.06, 1.54, 0.41 |
| Prob(H) (two-sided): | 0.00, 0.46, 0.00, 0.00 | Kurtosis: | 3.80, 2.99, 11.24, 4.30 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 685.8639 | 21.761 | 31.519 | 0.000 | 643.214 | 728.514 |
| L1.e(CO) | 0.4907 | 0.250 | 1.965 | 0.049 | 0.001 | 0.980 |
| L1.e(HC) | -0.1527 | 0.213 | -0.718 | 0.473 | -0.569 | 0.264 |
| L1.e(NO2) | 0.1581 | 0.120 | 1.317 | 0.188 | -0.077 | 0.393 |
| L1.e(O3) | 0.2525 | 0.130 | 1.938 | 0.053 | -0.003 | 0.508 |
| L2.e(CO) | 0.2124 | 0.263 | 0.807 | 0.420 | -0.304 | 0.729 |
| L2.e(HC) | -0.0736 | 0.218 | -0.338 | 0.735 | -0.500 | 0.353 |
| L2.e(NO2) | 0.1755 | 0.128 | 1.366 | 0.172 | -0.076 | 0.427 |
| L2.e(O3) | 0.1370 | 0.145 | 0.942 | 0.346 | -0.148 | 0.422 |
| L3.e(CO) | 0.3549 | 0.299 | 1.186 | 0.236 | -0.232 | 0.941 |
| L3.e(HC) | -0.0589 | 0.245 | -0.240 | 0.810 | -0.540 | 0.422 |
| L3.e(NO2) | 0.2617 | 0.136 | 1.924 | 0.054 | -0.005 | 0.528 |
| L3.e(O3) | 0.0195 | 0.151 | 0.129 | 0.897 | -0.276 | 0.315 |
| L4.e(CO) | 0.2412 | 0.301 | 0.800 | 0.424 | -0.349 | 0.832 |
| L4.e(HC) | -0.0145 | 0.264 | -0.055 | 0.956 | -0.532 | 0.503 |
| L4.e(NO2) | 0.1825 | 0.121 | 1.508 | 0.132 | -0.055 | 0.420 |
| L4.e(O3) | -0.0463 | 0.160 | -0.289 | 0.772 | -0.360 | 0.267 |
| L5.e(CO) | 0.1156 | 0.329 | 0.351 | 0.725 | -0.529 | 0.761 |
| L5.e(HC) | -0.1880 | 0.265 | -0.710 | 0.478 | -0.707 | 0.331 |
| L5.e(NO2) | 0.1219 | 0.154 | 0.794 | 0.427 | -0.179 | 0.423 |
| L5.e(O3) | 0.0512 | 0.154 | 0.332 | 0.740 | -0.251 | 0.353 |
| L6.e(CO) | 0.0791 | 0.290 | 0.273 | 0.785 | -0.489 | 0.647 |
| L6.e(HC) | -0.1545 | 0.261 | -0.592 | 0.554 | -0.666 | 0.357 |
| L6.e(NO2) | 0.0733 | 0.121 | 0.604 | 0.546 | -0.165 | 0.311 |
| L6.e(O3) | 0.0434 | 0.155 | 0.280 | 0.780 | -0.261 | 0.347 |
| L7.e(CO) | 0.0751 | 0.226 | 0.332 | 0.740 | -0.368 | 0.519 |
| L7.e(HC) | 0.2463 | 0.214 | 1.150 | 0.250 | -0.173 | 0.666 |
| L7.e(NO2) | 0.0678 | 0.082 | 0.829 | 0.407 | -0.093 | 0.228 |
| L7.e(O3) | -0.0809 | 0.125 | -0.649 | 0.516 | -0.325 | 0.163 |
| beta.T | 0.6651 | 0.664 | 1.002 | 0.316 | -0.635 | 1.966 |
| beta.RH | 3.6170 | 0.552 | 6.550 | 0.000 | 2.535 | 4.699 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 629.5415 | 23.681 | 26.584 | 0.000 | 583.127 | 675.956 |
| L1.e(CO) | -0.2401 | 0.292 | -0.822 | 0.411 | -0.812 | 0.332 |
| L1.e(HC) | 0.3034 | 0.252 | 1.204 | 0.229 | -0.191 | 0.797 |
| L1.e(NO2) | 0.0687 | 0.140 | 0.490 | 0.624 | -0.206 | 0.344 |
| L1.e(O3) | 0.2730 | 0.160 | 1.709 | 0.087 | -0.040 | 0.586 |
| L2.e(CO) | -0.0876 | 0.306 | -0.286 | 0.775 | -0.688 | 0.513 |
| L2.e(HC) | 0.1475 | 0.262 | 0.563 | 0.573 | -0.366 | 0.661 |
| L2.e(NO2) | 0.0753 | 0.151 | 0.499 | 0.618 | -0.220 | 0.371 |
| L2.e(O3) | 0.0644 | 0.162 | 0.399 | 0.690 | -0.252 | 0.381 |
| L3.e(CO) | 0.2388 | 0.333 | 0.717 | 0.473 | -0.414 | 0.891 |
| L3.e(HC) | 0.0711 | 0.291 | 0.245 | 0.807 | -0.499 | 0.641 |
| L3.e(NO2) | 0.1286 | 0.152 | 0.848 | 0.397 | -0.169 | 0.426 |
| L3.e(O3) | -0.0803 | 0.177 | -0.455 | 0.649 | -0.426 | 0.266 |
| L4.e(CO) | 0.2270 | 0.329 | 0.690 | 0.490 | -0.418 | 0.872 |
| L4.e(HC) | 0.0768 | 0.277 | 0.278 | 0.781 | -0.465 | 0.619 |
| L4.e(NO2) | 0.0795 | 0.136 | 0.585 | 0.559 | -0.187 | 0.346 |
| L4.e(O3) | -0.1597 | 0.174 | -0.917 | 0.359 | -0.501 | 0.182 |
| L5.e(CO) | 0.2639 | 0.339 | 0.778 | 0.436 | -0.401 | 0.929 |
| L5.e(HC) | -0.2728 | 0.268 | -1.019 | 0.308 | -0.797 | 0.252 |
| L5.e(NO2) | 0.0582 | 0.171 | 0.340 | 0.734 | -0.278 | 0.394 |
| L5.e(O3) | -0.0041 | 0.170 | -0.024 | 0.981 | -0.337 | 0.328 |
| L6.e(CO) | -0.0502 | 0.314 | -0.160 | 0.873 | -0.666 | 0.566 |
| L6.e(HC) | -0.1255 | 0.287 | -0.437 | 0.662 | -0.689 | 0.438 |
| L6.e(NO2) | 0.0449 | 0.134 | 0.335 | 0.738 | -0.218 | 0.308 |
| L6.e(O3) | 0.1059 | 0.160 | 0.662 | 0.508 | -0.208 | 0.419 |
| L7.e(CO) | -0.2439 | 0.279 | -0.875 | 0.381 | -0.790 | 0.302 |
| L7.e(HC) | 0.6594 | 0.251 | 2.626 | 0.009 | 0.167 | 1.151 |
| L7.e(NO2) | -0.0250 | 0.099 | -0.253 | 0.800 | -0.219 | 0.169 |
| L7.e(O3) | -0.1332 | 0.143 | -0.933 | 0.351 | -0.413 | 0.147 |
| beta.T | 2.1204 | 0.798 | 2.658 | 0.008 | 0.557 | 3.684 |
| beta.RH | 1.8283 | 0.667 | 2.739 | 0.006 | 0.520 | 3.136 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 602.8459 | 34.839 | 17.304 | 0.000 | 534.564 | 671.128 |
| L1.e(CO) | -0.0759 | 0.361 | -0.210 | 0.833 | -0.783 | 0.631 |
| L1.e(HC) | 0.1022 | 0.311 | 0.329 | 0.743 | -0.507 | 0.712 |
| L1.e(NO2) | 0.5609 | 0.160 | 3.504 | 0.000 | 0.247 | 0.875 |
| L1.e(O3) | -0.1250 | 0.201 | -0.621 | 0.534 | -0.519 | 0.269 |
| L2.e(CO) | -0.0290 | 0.465 | -0.062 | 0.950 | -0.941 | 0.882 |
| L2.e(HC) | -0.1752 | 0.387 | -0.453 | 0.651 | -0.934 | 0.584 |
| L2.e(NO2) | 0.2260 | 0.189 | 1.193 | 0.233 | -0.145 | 0.597 |
| L2.e(O3) | 0.0531 | 0.258 | 0.206 | 0.837 | -0.452 | 0.558 |
| L3.e(CO) | -0.4527 | 0.444 | -1.019 | 0.308 | -1.323 | 0.418 |
| L3.e(HC) | -0.0269 | 0.386 | -0.070 | 0.944 | -0.784 | 0.730 |
| L3.e(NO2) | -0.0292 | 0.179 | -0.163 | 0.871 | -0.380 | 0.322 |
| L3.e(O3) | 0.1461 | 0.258 | 0.567 | 0.571 | -0.359 | 0.651 |
| L4.e(CO) | -0.0529 | 0.396 | -0.134 | 0.894 | -0.829 | 0.723 |
| L4.e(HC) | -0.1525 | 0.412 | -0.370 | 0.711 | -0.960 | 0.655 |
| L4.e(NO2) | -0.0622 | 0.166 | -0.374 | 0.708 | -0.388 | 0.264 |
| L4.e(O3) | 0.0851 | 0.228 | 0.373 | 0.709 | -0.362 | 0.532 |
| L5.e(CO) | 0.4828 | 0.426 | 1.132 | 0.257 | -0.353 | 1.319 |
| L5.e(HC) | 0.1963 | 0.379 | 0.518 | 0.605 | -0.547 | 0.940 |
| L5.e(NO2) | 0.0869 | 0.195 | 0.445 | 0.656 | -0.296 | 0.470 |
| L5.e(O3) | -0.2144 | 0.220 | -0.976 | 0.329 | -0.645 | 0.216 |
| L6.e(CO) | 0.2111 | 0.432 | 0.489 | 0.625 | -0.636 | 1.058 |
| L6.e(HC) | 0.3580 | 0.361 | 0.991 | 0.321 | -0.350 | 1.066 |
| L6.e(NO2) | 0.1772 | 0.187 | 0.950 | 0.342 | -0.188 | 0.543 |
| L6.e(O3) | -0.1587 | 0.243 | -0.652 | 0.514 | -0.636 | 0.318 |
| L7.e(CO) | -0.1639 | 0.386 | -0.425 | 0.671 | -0.920 | 0.592 |
| L7.e(HC) | -0.3229 | 0.364 | -0.888 | 0.375 | -1.036 | 0.390 |
| L7.e(NO2) | 0.0584 | 0.117 | 0.501 | 0.616 | -0.170 | 0.287 |
| L7.e(O3) | 0.2199 | 0.203 | 1.085 | 0.278 | -0.177 | 0.617 |
| beta.T | 3.7148 | 0.990 | 3.753 | 0.000 | 1.775 | 5.655 |
| beta.RH | 0.0155 | 0.836 | 0.019 | 0.985 | -1.622 | 1.653 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 601.6083 | 39.005 | 15.424 | 0.000 | 525.160 | 678.056 |
| L1.e(CO) | -0.7652 | 0.510 | -1.500 | 0.134 | -1.765 | 0.235 |
| L1.e(HC) | -0.2130 | 0.470 | -0.453 | 0.651 | -1.135 | 0.709 |
| L1.e(NO2) | 0.0920 | 0.254 | 0.362 | 0.717 | -0.406 | 0.590 |
| L1.e(O3) | 1.0920 | 0.285 | 3.832 | 0.000 | 0.533 | 1.651 |
| L2.e(CO) | -0.2924 | 0.552 | -0.529 | 0.597 | -1.375 | 0.790 |
| L2.e(HC) | -0.0724 | 0.508 | -0.143 | 0.887 | -1.068 | 0.923 |
| L2.e(NO2) | 0.0352 | 0.273 | 0.129 | 0.897 | -0.500 | 0.570 |
| L2.e(O3) | 0.3085 | 0.290 | 1.064 | 0.287 | -0.260 | 0.877 |
| L3.e(CO) | 0.5408 | 0.612 | 0.883 | 0.377 | -0.659 | 1.741 |
| L3.e(HC) | -0.0505 | 0.526 | -0.096 | 0.923 | -1.081 | 0.980 |
| L3.e(NO2) | 0.2484 | 0.279 | 0.891 | 0.373 | -0.298 | 0.795 |
| L3.e(O3) | -0.1243 | 0.318 | -0.391 | 0.696 | -0.748 | 0.499 |
| L4.e(CO) | 0.4881 | 0.594 | 0.822 | 0.411 | -0.676 | 1.653 |
| L4.e(HC) | 0.1713 | 0.504 | 0.340 | 0.734 | -0.817 | 1.160 |
| L4.e(NO2) | 0.1908 | 0.260 | 0.733 | 0.464 | -0.320 | 0.701 |
| L4.e(O3) | -0.2991 | 0.314 | -0.954 | 0.340 | -0.914 | 0.316 |
| L5.e(CO) | 0.3745 | 0.650 | 0.576 | 0.565 | -0.900 | 1.649 |
| L5.e(HC) | -0.4065 | 0.536 | -0.759 | 0.448 | -1.457 | 0.644 |
| L5.e(NO2) | 0.0286 | 0.326 | 0.088 | 0.930 | -0.611 | 0.668 |
| L5.e(O3) | -0.0536 | 0.322 | -0.167 | 0.868 | -0.684 | 0.577 |
| L6.e(CO) | -0.1473 | 0.580 | -0.254 | 0.800 | -1.285 | 0.990 |
| L6.e(HC) | -0.3591 | 0.561 | -0.640 | 0.522 | -1.459 | 0.741 |
| L6.e(NO2) | 0.0091 | 0.255 | 0.036 | 0.971 | -0.491 | 0.509 |
| L6.e(O3) | 0.2376 | 0.319 | 0.746 | 0.456 | -0.387 | 0.862 |
| L7.e(CO) | -0.3120 | 0.509 | -0.613 | 0.540 | -1.309 | 0.685 |
| L7.e(HC) | 0.7920 | 0.477 | 1.660 | 0.097 | -0.143 | 1.727 |
| L7.e(NO2) | -0.0090 | 0.183 | -0.049 | 0.961 | -0.367 | 0.349 |
| L7.e(O3) | -0.1347 | 0.270 | -0.500 | 0.617 | -0.663 | 0.394 |
| beta.T | -0.8975 | 1.253 | -0.716 | 0.474 | -3.354 | 1.559 |
| beta.RH | 4.7617 | 1.046 | 4.552 | 0.000 | 2.711 | 6.812 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.CO | 71.4015 | 5.318 | 13.426 | 0.000 | 60.978 | 81.825 |
| sqrt.cov.CO.HC | 79.5154 | 7.654 | 10.389 | 0.000 | 64.514 | 94.517 |
| sqrt.var.HC | 39.6509 | 2.895 | 13.696 | 0.000 | 33.977 | 45.325 |
| sqrt.cov.CO.NO2 | -80.2762 | 9.532 | -8.421 | 0.000 | -98.959 | -61.593 |
| sqrt.cov.HC.NO2 | -14.4625 | 7.517 | -1.924 | 0.054 | -29.196 | 0.271 |
| sqrt.var.NO2 | 67.8179 | 3.991 | 16.994 | 0.000 | 59.996 | 75.640 |
| sqrt.cov.CO.O3 | 143.9678 | 13.106 | 10.985 | 0.000 | 118.280 | 169.656 |
| sqrt.cov.HC.O3 | 33.8801 | 5.028 | 6.738 | 0.000 | 24.025 | 43.735 |
| sqrt.cov.NO2.O3 | -1.0858 | 6.225 | -0.174 | 0.862 | -13.287 | 11.115 |
| sqrt.var.O3 | 48.0760 | 2.981 | 16.129 | 0.000 | 42.234 | 53.918 |
model = VARMAX(endog=y, exog=x, order=(0, 8), trend="c").fit(maxiter=1000, disp=False)
# making a directory if it does not exist
Path("./models").mkdir(parents=True, exist_ok=True)
# saving model to pickle
with open("./models/varmax08.pkl", "wb") as f:
pickle.dump(model, f)
print("Model saved!")
model.summary()
model.plot_diagnostics()
plt.show()
| variable | Value | |
|---|---|---|
| id | ||
| 0 | CO | 997.628033 |
| 1 | CO | 942.750789 |
| 2 | CO | 920.430462 |
| 3 | CO | 886.595934 |
| 4 | CO | 841.008239 |
| ... | ... | ... |
| 247 | O3 | 824.380572 |
| 248 | O3 | 758.550143 |
| 249 | O3 | 741.098003 |
| 250 | O3 | 745.516315 |
| 251 | O3 | 790.142356 |
252 rows × 2 columns
# generate predictions
predictions = model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
predictions.plot()
plt.title("Predicted Gas Values")
predictions = pd.melt(
predictions, value_vars=["CO", "HC", "NO2", "O3"], value_name="Value"
)
predictions.index.rename("id", inplace=True)
predictions
Exported predictions to .csv
RMSE: 152.09558
We will be using (0,8) as the second value as it has the 2 best parameter and comparing kaggle scores
# making a directory if it does not exist
Path("./kaggle").mkdir(parents=True, exist_ok=True)
predictions["Value"].to_csv("./kaggle/varmax08.csv")
print("Exported predictions to .csv")
Model saved!
| Dep. Variable: | ['CO', 'HC', 'NO2', 'O3'] | No. Observations: | 328 |
|---|---|---|---|
| Model: | VMAX(8) | Log Likelihood | -7091.633 |
| + intercept | AIC | 14483.266 | |
| Date: | Thu, 11 Aug 2022 | BIC | 15052.218 |
| Time: | 22:41:27 | HQIC | 14710.261 |
| Sample: | 03-15-2016 | ||
| - 02-05-2017 | |||
| Covariance Type: | opg |
| Ljung-Box (L1) (Q): | 4.51, 11.74, 7.89, 12.78 | Jarque-Bera (JB): | 3.18, 0.01, 677.46, 46.73 |
|---|---|---|---|
| Prob(Q): | 0.03, 0.00, 0.00, 0.00 | Prob(JB): | 0.20, 1.00, 0.00, 0.00 |
| Heteroskedasticity (H): | 1.71, 1.04, 1.82, 2.32 | Skew: | -0.09, 0.01, 1.40, 0.59 |
| Prob(H) (two-sided): | 0.01, 0.85, 0.00, 0.00 | Kurtosis: | 3.45, 3.00, 9.46, 4.43 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 685.8356 | 24.523 | 27.967 | 0.000 | 637.771 | 733.900 |
| L1.e(CO) | 0.1183 | 0.540 | 0.219 | 0.827 | -0.941 | 1.177 |
| L1.e(HC) | -0.1375 | 0.340 | -0.404 | 0.686 | -0.804 | 0.529 |
| L1.e(NO2) | 0.1514 | 0.236 | 0.641 | 0.521 | -0.311 | 0.614 |
| L1.e(O3) | 0.4098 | 0.214 | 1.917 | 0.055 | -0.009 | 0.829 |
| L2.e(CO) | 0.2265 | 0.643 | 0.352 | 0.725 | -1.034 | 1.487 |
| L2.e(HC) | 0.1381 | 0.429 | 0.322 | 0.747 | -0.702 | 0.978 |
| L2.e(NO2) | 0.1691 | 0.241 | 0.701 | 0.484 | -0.304 | 0.642 |
| L2.e(O3) | 0.0094 | 0.283 | 0.033 | 0.974 | -0.545 | 0.563 |
| L3.e(CO) | 0.5891 | 0.639 | 0.921 | 0.357 | -0.664 | 1.842 |
| L3.e(HC) | 0.1262 | 0.433 | 0.291 | 0.771 | -0.723 | 0.975 |
| L3.e(NO2) | 0.3714 | 0.212 | 1.750 | 0.080 | -0.045 | 0.787 |
| L3.e(O3) | -0.1358 | 0.338 | -0.402 | 0.687 | -0.798 | 0.526 |
| L4.e(CO) | 0.4675 | 0.501 | 0.934 | 0.350 | -0.514 | 1.449 |
| L4.e(HC) | 0.0445 | 0.356 | 0.125 | 0.901 | -0.654 | 0.743 |
| L4.e(NO2) | 0.2668 | 0.166 | 1.609 | 0.108 | -0.058 | 0.592 |
| L4.e(O3) | -0.1475 | 0.251 | -0.588 | 0.557 | -0.639 | 0.344 |
| L5.e(CO) | 0.2988 | 0.449 | 0.666 | 0.506 | -0.581 | 1.179 |
| L5.e(HC) | -0.1417 | 0.312 | -0.454 | 0.650 | -0.754 | 0.470 |
| L5.e(NO2) | 0.0798 | 0.167 | 0.477 | 0.633 | -0.248 | 0.408 |
| L5.e(O3) | -0.0910 | 0.192 | -0.473 | 0.636 | -0.468 | 0.286 |
| L6.e(CO) | 0.4309 | 0.506 | 0.852 | 0.394 | -0.560 | 1.422 |
| L6.e(HC) | -0.2918 | 0.337 | -0.867 | 0.386 | -0.952 | 0.368 |
| L6.e(NO2) | 0.1512 | 0.197 | 0.769 | 0.442 | -0.234 | 0.537 |
| L6.e(O3) | -0.0328 | 0.199 | -0.165 | 0.869 | -0.422 | 0.357 |
| L7.e(CO) | 0.2016 | 0.531 | 0.380 | 0.704 | -0.839 | 1.242 |
| L7.e(HC) | 0.1191 | 0.312 | 0.382 | 0.703 | -0.492 | 0.731 |
| L7.e(NO2) | 0.0889 | 0.192 | 0.462 | 0.644 | -0.288 | 0.465 |
| L7.e(O3) | -0.0792 | 0.191 | -0.414 | 0.679 | -0.454 | 0.295 |
| L8.e(CO) | 0.1436 | 0.420 | 0.342 | 0.733 | -0.680 | 0.967 |
| L8.e(HC) | -0.0345 | 0.245 | -0.141 | 0.888 | -0.514 | 0.445 |
| L8.e(NO2) | -0.0427 | 0.138 | -0.308 | 0.758 | -0.314 | 0.228 |
| L8.e(O3) | -0.0520 | 0.162 | -0.322 | 0.748 | -0.369 | 0.265 |
| beta.T | 0.5701 | 0.743 | 0.767 | 0.443 | -0.887 | 2.027 |
| beta.RH | 3.7081 | 0.624 | 5.939 | 0.000 | 2.484 | 4.932 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 629.5331 | 26.446 | 23.804 | 0.000 | 577.699 | 681.367 |
| L1.e(CO) | -0.4435 | 0.572 | -0.776 | 0.438 | -1.564 | 0.677 |
| L1.e(HC) | 0.3235 | 0.288 | 1.122 | 0.262 | -0.241 | 0.888 |
| L1.e(NO2) | 0.0879 | 0.234 | 0.375 | 0.707 | -0.371 | 0.547 |
| L1.e(O3) | 0.4018 | 0.220 | 1.828 | 0.068 | -0.029 | 0.832 |
| L2.e(CO) | 0.1854 | 0.673 | 0.275 | 0.783 | -1.134 | 1.505 |
| L2.e(HC) | 0.2063 | 0.365 | 0.565 | 0.572 | -0.509 | 0.921 |
| L2.e(NO2) | 0.0294 | 0.238 | 0.124 | 0.901 | -0.436 | 0.495 |
| L2.e(O3) | -0.0907 | 0.255 | -0.356 | 0.722 | -0.590 | 0.409 |
| L3.e(CO) | 0.2131 | 0.720 | 0.296 | 0.767 | -1.198 | 1.624 |
| L3.e(HC) | 0.1861 | 0.346 | 0.538 | 0.591 | -0.492 | 0.864 |
| L3.e(NO2) | 0.1195 | 0.225 | 0.531 | 0.595 | -0.321 | 0.560 |
| L3.e(O3) | -0.1332 | 0.277 | -0.481 | 0.630 | -0.676 | 0.409 |
| L4.e(CO) | 0.2037 | 0.584 | 0.349 | 0.727 | -0.942 | 1.349 |
| L4.e(HC) | 0.0986 | 0.317 | 0.311 | 0.756 | -0.524 | 0.721 |
| L4.e(NO2) | 0.0471 | 0.182 | 0.259 | 0.796 | -0.309 | 0.403 |
| L4.e(O3) | -0.1797 | 0.224 | -0.801 | 0.423 | -0.620 | 0.260 |
| L5.e(CO) | 0.1157 | 0.461 | 0.251 | 0.802 | -0.787 | 1.019 |
| L5.e(HC) | -0.2627 | 0.311 | -0.845 | 0.398 | -0.872 | 0.347 |
| L5.e(NO2) | -0.1032 | 0.171 | -0.605 | 0.545 | -0.438 | 0.231 |
| L5.e(O3) | -0.0431 | 0.204 | -0.211 | 0.833 | -0.443 | 0.357 |
| L6.e(CO) | -0.1173 | 0.554 | -0.212 | 0.832 | -1.203 | 0.968 |
| L6.e(HC) | -0.2897 | 0.338 | -0.856 | 0.392 | -0.953 | 0.374 |
| L6.e(NO2) | 0.0122 | 0.206 | 0.059 | 0.953 | -0.391 | 0.415 |
| L6.e(O3) | 0.1614 | 0.210 | 0.767 | 0.443 | -0.251 | 0.574 |
| L7.e(CO) | -0.3488 | 0.547 | -0.637 | 0.524 | -1.422 | 0.724 |
| L7.e(HC) | 0.5661 | 0.330 | 1.714 | 0.087 | -0.081 | 1.213 |
| L7.e(NO2) | -0.0342 | 0.212 | -0.161 | 0.872 | -0.450 | 0.382 |
| L7.e(O3) | -0.0430 | 0.188 | -0.229 | 0.819 | -0.412 | 0.326 |
| L8.e(CO) | -0.0426 | 0.442 | -0.096 | 0.923 | -0.909 | 0.824 |
| L8.e(HC) | 0.2034 | 0.272 | 0.748 | 0.454 | -0.329 | 0.736 |
| L8.e(NO2) | 0.0090 | 0.163 | 0.055 | 0.956 | -0.311 | 0.329 |
| L8.e(O3) | -0.0173 | 0.183 | -0.095 | 0.924 | -0.376 | 0.341 |
| beta.T | 2.0947 | 0.888 | 2.359 | 0.018 | 0.355 | 3.835 |
| beta.RH | 1.8466 | 0.744 | 2.481 | 0.013 | 0.388 | 3.305 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 602.8451 | 45.475 | 13.257 | 0.000 | 513.716 | 691.974 |
| L1.e(CO) | -0.3639 | 0.747 | -0.487 | 0.626 | -1.827 | 1.099 |
| L1.e(HC) | 0.1831 | 0.561 | 0.326 | 0.744 | -0.917 | 1.283 |
| L1.e(NO2) | 0.2600 | 0.355 | 0.732 | 0.464 | -0.436 | 0.956 |
| L1.e(O3) | -0.2641 | 0.396 | -0.666 | 0.505 | -1.041 | 0.513 |
| L2.e(CO) | -0.6775 | 0.869 | -0.780 | 0.436 | -2.381 | 1.026 |
| L2.e(HC) | -0.4270 | 0.706 | -0.605 | 0.546 | -1.812 | 0.958 |
| L2.e(NO2) | 0.0151 | 0.362 | 0.042 | 0.967 | -0.695 | 0.726 |
| L2.e(O3) | 0.3092 | 0.508 | 0.609 | 0.543 | -0.686 | 1.305 |
| L3.e(CO) | -1.1778 | 1.022 | -1.153 | 0.249 | -3.181 | 0.825 |
| L3.e(HC) | 0.0837 | 0.661 | 0.127 | 0.899 | -1.211 | 1.379 |
| L3.e(NO2) | -0.0619 | 0.306 | -0.202 | 0.840 | -0.661 | 0.537 |
| L3.e(O3) | 0.3483 | 0.497 | 0.701 | 0.483 | -0.626 | 1.322 |
| L4.e(CO) | -0.1544 | 1.061 | -0.146 | 0.884 | -2.233 | 1.924 |
| L4.e(HC) | -0.0861 | 0.651 | -0.132 | 0.895 | -1.362 | 1.189 |
| L4.e(NO2) | 0.1572 | 0.258 | 0.610 | 0.542 | -0.348 | 0.662 |
| L4.e(O3) | 0.1964 | 0.420 | 0.467 | 0.640 | -0.628 | 1.020 |
| L5.e(CO) | 0.3822 | 1.030 | 0.371 | 0.711 | -1.637 | 2.402 |
| L5.e(HC) | 0.0757 | 0.680 | 0.111 | 0.911 | -1.257 | 1.408 |
| L5.e(NO2) | 0.2457 | 0.270 | 0.910 | 0.363 | -0.284 | 0.775 |
| L5.e(O3) | -0.0523 | 0.422 | -0.124 | 0.901 | -0.880 | 0.776 |
| L6.e(CO) | 0.0161 | 0.927 | 0.017 | 0.986 | -1.802 | 1.834 |
| L6.e(HC) | 0.5908 | 0.680 | 0.868 | 0.385 | -0.743 | 1.924 |
| L6.e(NO2) | 0.2821 | 0.294 | 0.961 | 0.337 | -0.293 | 0.858 |
| L6.e(O3) | -0.1233 | 0.438 | -0.281 | 0.778 | -0.982 | 0.736 |
| L7.e(CO) | 0.5599 | 0.786 | 0.713 | 0.476 | -0.980 | 2.100 |
| L7.e(HC) | -0.0624 | 0.609 | -0.102 | 0.918 | -1.257 | 1.132 |
| L7.e(NO2) | 0.2839 | 0.283 | 1.004 | 0.315 | -0.270 | 0.838 |
| L7.e(O3) | -0.1221 | 0.367 | -0.332 | 0.740 | -0.842 | 0.598 |
| L8.e(CO) | 1.5249 | 0.568 | 2.685 | 0.007 | 0.412 | 2.638 |
| L8.e(HC) | -0.2664 | 0.505 | -0.527 | 0.598 | -1.257 | 0.724 |
| L8.e(NO2) | 0.4698 | 0.225 | 2.092 | 0.036 | 0.030 | 0.910 |
| L8.e(O3) | -0.3388 | 0.290 | -1.170 | 0.242 | -0.906 | 0.229 |
| beta.T | 3.4976 | 1.242 | 2.817 | 0.005 | 1.064 | 5.931 |
| beta.RH | 0.1338 | 1.078 | 0.124 | 0.901 | -1.978 | 2.246 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 601.6244 | 44.607 | 13.487 | 0.000 | 514.196 | 689.053 |
| L1.e(CO) | -0.7557 | 1.092 | -0.692 | 0.489 | -2.896 | 1.384 |
| L1.e(HC) | -0.2431 | 0.570 | -0.426 | 0.670 | -1.360 | 0.874 |
| L1.e(NO2) | 0.2285 | 0.466 | 0.490 | 0.624 | -0.685 | 1.142 |
| L1.e(O3) | 1.2278 | 0.427 | 2.878 | 0.004 | 0.392 | 2.064 |
| L2.e(CO) | 0.4584 | 1.344 | 0.341 | 0.733 | -2.176 | 3.093 |
| L2.e(HC) | 0.1462 | 0.671 | 0.218 | 0.828 | -1.169 | 1.462 |
| L2.e(NO2) | 0.0565 | 0.468 | 0.121 | 0.904 | -0.860 | 0.973 |
| L2.e(O3) | -0.0657 | 0.514 | -0.128 | 0.898 | -1.074 | 0.942 |
| L3.e(CO) | 0.7424 | 1.404 | 0.529 | 0.597 | -2.009 | 3.494 |
| L3.e(HC) | 0.1441 | 0.633 | 0.228 | 0.820 | -1.096 | 1.384 |
| L3.e(NO2) | 0.3306 | 0.429 | 0.771 | 0.441 | -0.510 | 1.171 |
| L3.e(O3) | -0.2404 | 0.536 | -0.449 | 0.654 | -1.291 | 0.810 |
| L4.e(CO) | 0.1604 | 1.105 | 0.145 | 0.885 | -2.006 | 2.326 |
| L4.e(HC) | 0.1482 | 0.618 | 0.240 | 0.810 | -1.063 | 1.359 |
| L4.e(NO2) | 0.0744 | 0.368 | 0.202 | 0.840 | -0.646 | 0.795 |
| L4.e(O3) | -0.2051 | 0.401 | -0.511 | 0.609 | -0.991 | 0.581 |
| L5.e(CO) | 0.2558 | 0.905 | 0.283 | 0.777 | -1.517 | 2.029 |
| L5.e(HC) | -0.4827 | 0.623 | -0.775 | 0.438 | -1.703 | 0.738 |
| L5.e(NO2) | -0.2575 | 0.351 | -0.735 | 0.463 | -0.945 | 0.429 |
| L5.e(O3) | -0.1401 | 0.394 | -0.355 | 0.722 | -0.913 | 0.633 |
| L6.e(CO) | -0.1564 | 1.054 | -0.148 | 0.882 | -2.223 | 1.910 |
| L6.e(HC) | -0.6117 | 0.659 | -0.928 | 0.353 | -1.904 | 0.680 |
| L6.e(NO2) | -0.0288 | 0.406 | -0.071 | 0.943 | -0.824 | 0.767 |
| L6.e(O3) | 0.2725 | 0.421 | 0.647 | 0.517 | -0.552 | 1.097 |
| L7.e(CO) | -0.4347 | 1.055 | -0.412 | 0.680 | -2.503 | 1.634 |
| L7.e(HC) | 0.6924 | 0.633 | 1.093 | 0.274 | -0.549 | 1.934 |
| L7.e(NO2) | 0.0107 | 0.394 | 0.027 | 0.978 | -0.761 | 0.783 |
| L7.e(O3) | -0.0402 | 0.385 | -0.105 | 0.917 | -0.794 | 0.714 |
| L8.e(CO) | -0.1603 | 0.899 | -0.178 | 0.859 | -1.922 | 1.602 |
| L8.e(HC) | 0.1520 | 0.522 | 0.291 | 0.771 | -0.872 | 1.176 |
| L8.e(NO2) | -0.0145 | 0.301 | -0.048 | 0.961 | -0.604 | 0.575 |
| L8.e(O3) | 0.0701 | 0.335 | 0.209 | 0.834 | -0.587 | 0.727 |
| beta.T | -0.8484 | 1.443 | -0.588 | 0.557 | -3.677 | 1.981 |
| beta.RH | 4.7549 | 1.202 | 3.957 | 0.000 | 2.400 | 7.110 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| sqrt.var.CO | 71.2068 | 7.225 | 9.856 | 0.000 | 57.046 | 85.367 |
| sqrt.cov.CO.HC | 79.6116 | 9.198 | 8.655 | 0.000 | 61.583 | 97.640 |
| sqrt.var.HC | 39.6937 | 5.347 | 7.424 | 0.000 | 29.214 | 50.173 |
| sqrt.cov.CO.NO2 | -80.2902 | 15.805 | -5.080 | 0.000 | -111.268 | -49.312 |
| sqrt.cov.HC.NO2 | -14.3333 | 15.876 | -0.903 | 0.367 | -45.449 | 16.782 |
| sqrt.var.NO2 | 67.5224 | 7.847 | 8.605 | 0.000 | 52.142 | 82.903 |
| sqrt.cov.CO.O3 | 144.0027 | 15.541 | 9.266 | 0.000 | 113.543 | 174.463 |
| sqrt.cov.HC.O3 | 33.7115 | 10.067 | 3.349 | 0.001 | 13.981 | 53.442 |
| sqrt.cov.NO2.O3 | -1.0173 | 11.066 | -0.092 | 0.927 | -22.707 | 20.672 |
| sqrt.var.O3 | 47.9515 | 5.222 | 9.182 | 0.000 | 37.716 | 58.187 |
model.plot_diagnostics()
plt.show()
# generate predictions
predictions = model.predict(start=X_test.index[0], end=X_test.index[-1], exog=X_test)
predictions.plot()
plt.title("Predicted Gas Values")
predictions = pd.melt(
predictions, value_vars=["CO", "HC", "NO2", "O3"], value_name="Value"
)
predictions.index.rename("id", inplace=True)
predictions
| variable | Value | |
|---|---|---|
| id | ||
| 0 | CO | 1014.665093 |
| 1 | CO | 954.341356 |
| 2 | CO | 931.782188 |
| 3 | CO | 893.754513 |
| 4 | CO | 847.458719 |
| ... | ... | ... |
| 247 | O3 | 824.917547 |
| 248 | O3 | 759.103056 |
| 249 | O3 | 741.655289 |
| 250 | O3 | 746.205530 |
| 251 | O3 | 790.700171 |
252 rows × 2 columns
# making a directory if it does not exist
Path("./kaggle").mkdir(parents=True, exist_ok=True)
predictions["Value"].to_csv("./kaggle/varmax07.csv")
print("Exported predictions to .csv")
Exported predictions to .csv
RMSE: 151.68211
We are able to create an forecasting model to predict air pollution in the air. With climate change being a constant threat to mankind's survival, we should start keeping the environment clean. However, the model still faces some limitations. It has only been trained and tested on an older set of data point from 2016 - 2018. As technology advances and human life advances, more pollution will appear, the trend and season of air pollution will change. For the model to be fully deployed, more data needs to be collected and use for prediction to make the model better.
The difficulty I found doing the forecasting task was being interpret the ACF and PACF plot. I feel that the points are not as accurate as the order collected using the cross validation techniques like expanding windows.